Facebook: Downtime was due to configuration error backbone routers
The hours-long downtime that Facebook and its services experienced Monday night was due to a change in the backbone routers for its data centers. There were already indications that the cause was a BGP update from Facebook.
Facebook did not provide details about the configuration changes that led to the large-scale problems. The company does report that the disruption resulted in a snowball effect that made the company’s services inaccessible. “The root cause of the outage also impacted many of the internal tools and systems we use in our day-to-day work, making it difficult to diagnose and fix the issue quickly,” the company said. Facebook apologizes to users.
The report confirms reports that appeared online soon after Monday’s downtime, including from a Reddit user named Ramenporn who is believed to be working with the team investigating the disruption. In posts that have since been deleted, he claimed that the cause was configuration changes related to BGP peering.
This would make the Facebook networks unreachable via the routing tables and the administrators could no longer remotely access the routers to fix the problem. System administrators therefore had to physically reach the routers to make changes, but they would not have the right knowledge to do so, with the lack of communication making knowledge transfer difficult. According to an editor of the New York Times would be an additional obstacle that employees could not enter the data centers because their badges no longer worked.
Facebook is applying BGP in its own way on a large scale in its data center networks, the company’s engineers previously described in a paper titled Running BGP in Data Centers at Scale. The Facebook technicians state that this allows them to implement ‘fast incremental updates’, among other things.
Cloudflare explains in an analysis that Facebook was indeed pushing BGP updates to its networks prior to the issues. BGP stands for Border Gateway Protocol and is the protocol that controls network communication between networks of Autonomous Systems. Via BGP, networks advertise themselves with prefixes on the Internet, so that they can be reached with routing tables. With the update to its backbone routers, Facebook stopped announcing itself, causing networks worldwide to stop responding to DNS queries related to Facebook and its services. This, in turn, led to further problems as clients worldwide continued to access Facebook’s name servers, resulting in a deluge of DNS traffic, which could overload DNS resolvers. Traffic to other services, such as Twitter, also increased. The outage lasted about six hours.
Availability of DNS name Facebook.com on Cloudflares resolver