August 19, 2018

Building Switch Software at Facebook Scale

By: Sean Choi, James Hongyi Zeng

This week at ACM SIGCOMM 2018 in Budapest, Hungary, we are sharing details on FBOSS, the system running on Facebook-designed network switches powering our global data centers. Our SIGCOMM paper, Building Switch Software at Facebook Scale, provides the details on how we design, implement and operate one of the world’s largest open source switch platforms at scale.

Every day, people and communities use Facebook’s infrastructure to share information, ranging from messages, news feed or posts, to images and videos. To support these user-facing features, there are numerous products and research projects that require a large amount of data to be processed and transferred between different groups of machines within Facebook. Over the years, we have built multiple data centers with a large amount of compute and network infrastructure. To set a context for how fast the network is growing, our FBOSS deployments in our data centers increased by 30x over a period of two years as seen below:

To support the fast growth of the network, we soon found that traditional methods of building and deploying switch software did not fully address our needs. Therefore, we decided to build and deploy our own switch hardware, Wedge and Backpack. Then we built an open source switch software called FBOSS, which stands for Facebook Open Switching System, to run on this hardware. Before we dive into the details of FBOSS, let’s first set the context on why we needed our own switching software.

Challenges

One of the main technical challenges in running large networks is managing the complexity of excess networking features. Most switch vendors understandably try their best to build common software that can meet the needs of their entire customer base; thus their software includes the union of all features requested by all customers over the lifetime of the product. However, more features lead to more code, which can lead to increased bugs, security holes, operational complexity and downtime. We wanted to build software that implements only a carefully selected subset of networking features that we absolutely need.

Further, scaling a large network requires a high rate of innovation while maintaining network stability. Vendors prioritize changes and features by how well they correlate across all of their customers. We found instances where our feature needs did not correlate well across the vendors’ other customers.

Finally, another challenge we faced when using vendor switches and switch software is the difficulty in integrating the software into existing infrastructure. Facebook already has infrastructure in place for managing, monitoring and deploying general software services. However, since this software is built in-house at Facebook, switch vendors do not have full access to the code. Therefore, we had to spend additional effort to integrate vendor switch software with existing infrastructure.

These challenges motivated us to venture into the world of building our own open source switching software that can be built in an incremental fashion with easy integration with existing Facebook infrastructure. We have covered some basic concepts in our blog post before. Now let’s discuss some details of the FBOSS architecture. For more information, please check out the source in the FBOSS GitHub repo.

FBOSS Architecture

FBOSS was designed with the following two design principles in mind.

Switch-as-a-server: We wanted to design switch software as if we are building a large-scale software service. Facebook has infrastructure in place to build, deploy and monitor large-scale software services. Therefore, we wanted to leverage this infrastructure to build our own switch software.
Deploy-early-and-iterate: We built FBOSS in an incremental fashion, starting the project by building the minimal amount of networking features possible and deployed it into production. We then fixed bugs through various failures, iterated on new versions and quickly added small features as needed. This design principle allowed us to launch the slimmest version of our software and then quickly iterate on new features.

With the design principle in mind, we built FBOSS consisting of the following components. The components interact with one another as shown in the following diagram.

Let’s now dive into each component in detail. More details of each component are shared in our paper.

Switch SDK: A switch SDK is ASIC vendor-provided software that exposes APIs for interacting with low-level ASIC functions. These APIs include ASIC initialization, installing forwarding table rules, and listening to event handlers.
HwSwitch: The HwSwitch represents an abstraction of the switch hardware. The interfaces of HwSwitch provide generic abstractions for configuring switch ports, sending and receiving packets to these ports, and registering callbacks for state changes on the ports and packet input/output events that occur on these ports.
Hardware Abstraction Layer: FBOSS allows users to easily add implementation that supports a specific ASIC by extending the HwSwitch interface. This also allows easy support for multiple ASICs without making changes to the main FBOSS code base. The custom implementation must support the minimal set of functionalities that are specified in HwSwitch interface.
State Observers: SwSwitch makes it possible to implement low-level control protocols such as ARP, NDP, LACP and LLDP by keeping protocols apprised of state changes. The protocols are notified of state changes via a mechanism called state observation. Specifically, any object at the time of its initialization may register itself as a State Observer. By doing so, every future state change invokes a callback provided by the object. The callback provides the state change in question, allowing the object to react accordingly.
SwSwitch: The SwSwitch provides the hardware independent logic for switching and routing packets, and interfaces with the HwSwitch to transfer the commands down to the switch ASIC.
Thrift Management Interface: We run our networks in a split control configuration. Each FBOSS instance contains a local control plane, running protocols such as BGP or OpenR, on a microserver that communicates with a centralized network management system through a Thrift Management Interface. Given that the interfaces can be modified to fit our needs, Thrift provides us with a simple and flexible way to manage and operate the network, leading to increased stability and high availability.
QSFP Service: The QSFP service manages a set of QSFP ports. This service detects QSFP insertion or removal, reads QSFP product information (e.g., manufacturer), controls QSFP hardware function (i.e., change power configuration), and monitors the QSFPs.

FBOSS Testing and Deployment

Switch software is conventionally developed and released by switch vendors to a large customer base. Therefore, a new release to the switch software can take months, with extended development and QA test cycles. In addition, given that software update cycles are infrequent, an update usually contains a large number of changes that can introduce new bugs that did not exist previously. In contrast, typical Facebook software deployment processes are much more frequent and thus contain a smaller set of changes per update. Furthermore, feature deployments are coupled with automated and incremental testing mechanisms to quickly check and fix bugs. Our outage records from network operational data of a representative month, shown in the figure below, show that about 40% of the outages were hardware-related and the other 60% were software-related. This led us to develop a suite of software that is responsible for testing and deploying features in an agile fashion.

Instead of using existing automatic software deployment framework like Chef or Jenkins, FBOSS employs its own deployment software called fbossdeploy, which is purpose-built to maintain a tighter feedback loop with existing external monitors such as Beringei and Scuba. Using fbossdeploy, FBOSS employs the practice of three-stage deployment, which includes testing on a subset of production switches with lower importance. The three stages of deployment are as follows.

Continuous Canary: Automatically deploys every commit continuously to a few production switches for each type of switch in our network. Monitors check for immediate failures with the commits and reverts the changes if any abnormalities are detected. Continuous canary is able to quickly catch errors related to switch initialization, such as issues with warm boot, configuration errors, and unpredictable race conditions.
Daily Canary: Daily canary runs once a day, as its name suggests, and automatically deploys all of a single day’s commits to 10 to 20 production switches for each type of switch in our network. Monitors check for bugs that may surface over longer period of time, such as memory leaks or performance regressions in critical threads, and revert the changes when such abnormalities are detected.
Staged Deployment: Once daily canary completes, a human operator intervenes to push the latest code to all of the switches in production. This process is performed once every two weeks for increased reliability. Once the number of failed switches exceeds a preset threshold, usually around 0.5% of the entire switch fleet, the deployment script stops and asks the operator to investigate the issues and take appropriate actions. The reasons for keeping the final step manual are as follows: First, a single server is fast enough to deploy the code to all of the switches in the data center, meaning that the deployment process is not bottlenecked by one machine deploying the code. Secondly, it allows the operator to carry out fine-grained monitoring over the unpredicted bugs that may not be automatically caught by the existing monitors.

We are continually striving to improve our testing and deployment infrastructure. We aim to increase the frequency of our feature deployment, while not negatively affecting our reliability.

FBOSS Management

As mentioned in the introduction, FBOSS is integrated into the existing network management system (Robotron). We now discuss the details of the integration.

Robotron is Facebook’s main network management system. It is responsible for generating, storing and disseminating configurations for FBOSS. Robotron contains the centralized configuration database, which FBOSS draws its configuration data from. The configuration of network devices is highly standardized in data center environments. Given a specific topology, each device is automatically configured by using templates and auto-generated configuration data. For example, the IP address configuration for a switch is determined by the type of the switch (e.g., ToR or aggregation), and its upstream/downstream neighbors in the cluster.

Once an active configuration has been generated and distributed, Robotron can instruct FBOSS to use different versions of the configuration. In order to quickly and safely change between different configurations, FBOSS stages all prior configurations in its own database. If there is a need to revert to a prior configuration, FBOSS can simply reuse the staged configurations. Robotron uses other monitoring infrastructure to store device states from FBOSS and makes decisions whether to use a certain version of the configuration.

In addition to managing configurations, Robotron monitors FBOSS operational states and performance via Thrift interfaces and Linux system logs. Traditionally, data center operators use standardized network management protocols, such as SNMP, to collect switch statistics, such as CPU/memory utilization, link load, packet loss and miscellaneous system health, from the vendor network devices. However, the Thrift interface on FBOSS allows us to define our own data collection specifications and change them whenever we need. Also, the Thrift monitoring system is faster and can be optimized to reduce collection time. Finally, Linux logs provide detailed lower-level logs for our engineers to use that allow them to further analyze and improve the system.

Moving Forward

Data center networks are quickly evolving and growing at a rapid rate. Many large data center operators are building their own white-box switches and deploying their own software on them—and FBOSS is one such project. Overall, Facebook has taken a software-centric approach to the future of switch software, and by sharing our design and experiences, we hope that we influence upcoming changes in network systems in both industry and academia.

Acknowledgement

Many people in the networking team at Facebook have contributed to FBOSS over the years and toward this paper. In particular, Adam Simpkins, Tian Fang and Jasmeet Bagga are among the initial team who architected the FBOSS software design. Rob Sherwood, Alex Eckert, Boris Burkov, Saman Kazemkhani and Ying Zhang contributed heavily to the SIGCOMM paper.

Areas

Networking & Connectivity, Systems & Infrastructure

Tags

Networking

Share