August 20, 2017

Steering oceans of content to the world

By: Hyojeong Kim, James Hongyi Zeng

This week at ACM SIGCOMM 2017 in Los Angeles, we are sharing details on Edge Fabric, the system we built to make our edge networking faster, more efficient, and more flexible. Edge Fabric has been in production for the past several years, steering egress user traffic from our PoPs around the world. Our SIGCOMM paper, Engineering Egress with Edge Fabric: Steering Oceans of Content to the World, provides the first public details on Facebook’s connectivity to the internet and how we manage it, including opportunities and challenges we have seen.

Networking is an integral part of Facebook’s infrastructure. Facing the fast-growing demand from both services and people, we embarked several years ago on building in-house software systems to scale our network. We have shared details of the systems we have built that span our entire network, including:

  • In the data center — FBOSS, Wedge, and Backpack are the hardware and software that power the open switches running in our data centers, a key step toward a disaggregated network.
  • In the backbone — Express Backbone and Open/R make up our SD-WAN, a dedicated private backbone network that leverages centralized traffic engineering in a global controller coupled with a new distributed routing system.
  • Cross-domain automation/life cycle — Robotron is a system that handles every part of the network life cycle across all network domains, from data center to backbone and edge.
  • End-to-end monitoring — Host-based system fbflow provides end-to-end flow monitoring.

Following Facebook’s open culture, we have shared the details of all of these in many different ways. We regularly publish the system design and operation experience for top academic conferences such as SIGCOMM and NSDI; we have open-sourced several parts of our software stack such as FBOSS, NetNorad, fbzmq, and fbpush to the Open Compute Project and contributed switching hardware designs; we regularly talk about our networks at industry conferences such as Networking @Scale, OCP Summit, IETF, and NANOG; and we even share the fbflow data collected from our production data centers.

Edge Fabric

Facebook is one of the largest content providers on the internet. Every day, people and communities rely on Facebook’s global infrastructure to share information, ranging from photos and videos to messages and News Feed. To better serve this content, we have built out a global network of points of presence (PoPs), each connected to tens or hundreds of networks.

To support the best user experience, our edge networking team needs to take into account the complex connectivity options in each PoP, the preference of each network, and application-level performance. We soon found that traditional routing protocols could not address our needs. Before we dive into the system details, let’s first set the context.

Challenges

When somebody visits Facebook, our global load-balancing system, Cartographer, assigns them to PoPs via DNS. After arriving at a PoP, user requests either terminate there or are relayed back to our data centers for further processing. Responses are sent back to users via the same PoP.

At each PoP, Facebook can have diverse routing options to reach a particular person. The most common options are:

  • Private peering — a dedicated link to the end user’s network.
  • Public peering — a shared fabric of an internet exchange point to the end user’s network.
  • Transit provider — another network that promises to deliver the content to the end user’s network eventually.

Normally, that order is the preference of most networks. Such preference is configured statically and controlled by Border Gateway Protocol (BGP), a protocol between networks for exchanging routing information.

We found that BGP on its own was unable to easily address a couple of important limitations:

  • Peering capacity is limited, but BGP is not capacity-aware. A link’s capacity may not be able to deliver all traffic that Facebook would like to send over it. Rapid growth in demand can quickly overwhelm the capacity of an existing interconnection. Short-term spikes in demand (perhaps due to an event or a holiday) or reductions in capacity due to failures can cause volatility in demand and available capacity. In any of these cases, BGP’s decision won’t change, as it is not capacity-aware.
  • BGP decisions can hurt performance. Facebook’s BGP policy favors paths that are likely to optimize network traffic performance. However, because BGP itself is not performance-aware, this policy relies on attributes such as path length that serve as imperfect heuristics for proximity and performance. Our paper gives several examples of where these heuristics can go wrong.

System architecture

In 2013, we started to develop Edge Fabric to overcome these challenges. Edge Fabric consists of several loosely coupled microservices.

Data Collection

Edge Fabric first needs to know all the routes from a PoP to a destination, and which routes traffic will traverse if it does not intervene. In addition, it needs to know the volume of traffic per destination prefix and the capacity of egress interfaces in the PoP. We built a BMP collector (BGP monitoring protocol) and traffic collector to assemble this information.

Controller

The allocator projects what the utilization of all egress interfaces in the PoP would be if no overrides are injected, assigning each prefix to its preferred route(s) from its emulated BGP path selection. The BGP best path computation process may return multiple equivalent routes.

Every 30 seconds, the allocator generates overrides to shift traffic away from interfaces that it projects will be overloaded. For each overloaded interface, the allocator identifies the prefixes projected to traverse the interface and, for each prefix, the available alternate routes. It then identifies the single ⟨prefix, alternate route⟩ pairing that it prefers to shift first from the interface, applying the several rules in order until a single preference emerges.

Once a pairing has been selected, the allocator records the decision and updates its projections, removing the prefix’s traffic from the original interfaces and placing all of the PoP’s traffic for the prefix onto the interface of the selected alternate route.

Enacting Overrides

In each round, the allocator generates a set of BGP updates for Edge Fabric’s overrides and assigns each update a very high local preference. The allocator passes the BGP updates to the BGP Injector service, which maintains a BGP connection with every router in the PoP and enacts the overrides by announcing the BGP updates to the target routers. The injector service then withdraws any overrides that are no longer valid in the current allocation.

These microservices together make up the Edge Fabric system, and the effect is that Edge Fabric is able to augment BGP with measurement and control mechanisms to overcome BGP’s lack of congestion awareness and performance awareness. We designed it to be simple and scalable, taking advantage of centralized control and existing support in vendor software and hardware.

The paper has more details about these microservices, as well as how we incorporate application-level performance metrics such as bandwidth and latency into the decision process. We also share a brief history of system evolution as well as our operation experience.

Looking ahead

Today’s internet traffic is generally managed by a small number of big content providers. How they interact with other networks largely shapes inter-domain routing around the world.

BGP will be the internet’s inter-domain routing standard for the foreseeable future. By sharing our experience engineering our egress traffic, including a detailed look at the opportunities and challenges presented by the internet connectivity of today’s large content providers, we hope that the limitations of BGP can be better understood and the experience of everybody on the internet can be improved.

Edge Fabric complements the systems Facebook has previously shared, as it goes into details of the edge of our network, in addition to our backbone, data center networks, and management systems. Overall, Facebook has taken a software-centric approach to the future of our network, and we look forward to continuing to share and discuss these network systems with both industry and academia.

Acknowledgements

Many people from the Facebook networking team contributed to this project over the past several years, including Brandon Schlinker, Timothy Cui, James Quinn, Saif Hasan, Petr Lapukhov, Niky Riga, Varghese Vaidhyan, Xintong Hu, Steve Shaw, Manikandan Somasundaram, Lisa Guo, Bolek Kulbabinski, Alexander Kramarov, Vincent Mauge, Jimmy Ottosson, Callahan Warlick.