Crashing the Internet

Something interesting happened on the Internet Friday that you may not have noticed. For a fairly brief period of time, about 2% of sites lost their connection (couldn’t communicate with the rest of the Internet.) It was caused by a latent bug in Cisco routers that was triggered by an experiment run by researchers at Duke University with RIPE NCC, Europe’s Internet management organization:

The problem started just before 9 a.m. Greenwich Mean Time Friday and lasted less than half an hour. It was kicked off when RIPE NCC (Reseaux IP Europeens Network Coordination Centre) and Duke ran an experiment that involved the Border Gateway Protocol (BGP) — used by routers to know where to send their traffic on the Internet. RIPE started announcing BGP routes that were configured a little differently from normal because they used an experimental data format. RIPE’s data was soon passed from router to router on the Internet, and within minutes it became clear that this was causing problems.

BGP is the protocol that advertises routes between the 40,000 networks that comprise the Internet, and the specification permits advertisement to carry optional elements called “attributes.” One use of attributes is to display Quality of Service characteristics, but there are others. The attribute that the Duke/RIPE folks were testing was valid enough, but it was something that Cisco routers hadn’t seen before and therefore didn’t know how to process. In a case like this, the router has a couple of choices: it can pass it on or it can ignore it. The Cisco code did the one thing it shouldn’t have done, it corrupted the route and then passed it on. This caused networks downstream from the offending router to break off communication with the network that sent them a bad route. That broke down the routing to 2% of sites.

The researchers suspended their experiment, which restored order, and Cisco quickly issued the following Security Advisory: Cisco IOS XR Software Border Gateway Protocol Vulnerability

Cisco IOS XR Software contains a vulnerability in the Border Gateway Protocol (BGP) feature. The vulnerability manifests itself when a BGP peer announces a prefix with a specific, valid but unrecognized transitive attribute. On receipt of this prefix, the Cisco IOS XR device will corrupt the attribute before sending it to the neighboring devices. Neighboring devices that receive this corrupted update may reset the BGP peering session.

Affected devices running Cisco IOS XR Software corrupt the unrecognized attribute before sending to neighboring devices, but neighboring devices may be running operating systems other than Cisco IOS XR Software and may still reset the BGP peering session after receiving the corrupted update. This is per standards defining the operation of BGP.

Cisco developed a fix that addresses this vulnerability and will be releasing free software maintenance upgrades (SMU) progressively starting 28 August 2010. This advisory will be updated accordingly as fixes become available.

This advisory is posted at http://www.cisco.com/warp/public/707/cisco-sa-20100827-bgp.shtml.

This will happen again.

The BGP protocol is not experiment-friendly, as it lacks a robust data format for experimental attributes. The normal way to represent network features that aren’t widely recognized is to house them in a software container that allows systems to skip the entire feature while parsing a stream of bytes. The most common way to do this – and a standard since the days of the OSI protocol work in the 1980s – is with a Type-Length-Value (TLV) envelope. The “Length” part enables a software parser to skip to the next element cleanly when it doesn’t know what to do with the “Type” element, which will be in case in experimental situations.

So the exercise boils down to this: researchers attempting to enrich one of the Internet’s fundamental behaviors, routing, encountered a bug in one of the Internet’s fundamental elements, Cisco software, brought about by the poor specification of one of the Internet’s fundamental protocols, BGP. I hope network operators will apply the patch and that the researchers try again, but the example illustrated some of the inertia in large systems such as the Internet. They can be changed, but slowly and carefully, with steps forward and backward in the process.