Hi, I am Linpeng Tang, a PhD student from Princeton University. During the past summer I worked closely with the Video Infrastructure team on the load testing framework for Streaming Video Engine (SVE), which processes uploaded videos in parallel before serving them to users, and I also collaborated with the Facebook Content Distribution Network (FBCDN) team to put RIPQ (FAST’15 paper, FB Research Blog) a novel caching framework we designed previously into production. Here I want to share with you several things I have learned during this process (especially from the view of a graduate student) in three different aspects: (1) the technical part, how to build a production quality system, (2) the non-technical part, how to thrive in an open environment like Facebook, (3) and lastly, the relationship between production and research.
How to build a production quality system
Building a system from scratch until it runs on thousands of servers and serves billions of users is highly rewarding, but also much more complicated than, say, creating a prototype in a research lab. So what are some important guidelines?
Tests, and more tests. What’s a better way to ensure a system’s reliability than testing it in a realistic environment? However, there is an art to testing as well. Take SVE for example, we have a production tier handling user requests. In addition, a test tier is set up to run load tests and test new features without damaging the whole system. And finally, each engineer can also launch a tiny dev tier directly from his development box for some small and quick experiments. The 3 tiers are smaller and smaller in size, but also allow higher and higher agility for fast iterations.
Logs and graphs. Once written, the software will usually stay in production for quite some time, so it’s important to make it easily maintainable. Aside from good code design and clear comments, this also involves logging events and statistics and aggregating them to generate real-time graphs. Actually inside Facebook (and from what I heard in other big tech companies as well) whole teams are dedicated to make these tools. They help us quickly pinpoint the issue when something happens, and also provides an invaluable instrument for monitoring performance characteristics of the system for better understanding and improvement.
Learning to fail. Intriguingly, a crucial way to understand system behavior better and ensure reliability in the long run is to push it to the limit and observe how it fails. Part of my internship was to design and implement a load testing framework for SVE to generate different kinds of workloads, and then to apply higher and higher load to the system until it breaks. These load tests helped the team identify multiple issues and performance bottlenecks, and ultimately accelerated its launching. Inside Facebook some teams even work to move the user requests around and stress test the whole backend for the same purposes.
How to thrive in an open environment
Facebook boasts of an open culture with a lot of self-management and collaborations between teams, while the management level takes more of a service role. Below are some of the qualities I have observed from my exemplar teammates, which I think are important in order to thrive in such an open environment.
Reach out. One can probably stay comfortably within the realms of the team and just do what he is assigned to, but that would be wasting the great advantage of working at Facebook. Reach out to other teams, see what they are doing, hear what problems they are facing, and think how to modify existing tools, or even build new ones to solve these problems—new projects often stem from this process.
And if you feel excited about the work some other team is doing, why not join them? We hear stories where the whole department in some big companies is laid off or transitioned to other positions because it no longer creates value. I often joke that such incidents are unlikely to occur in Facebook because the engineers would have left for more interesting projects themselves long before then!
Know other people’s perspective. Collaboration can be both rewarding and frustrating. Any form of communications takes valuable time away from the work people are officially assigned to, so we must always respect their commitment. Know their perspective and goal in the company, so you know would understand their response much better, and see how the collaboration can benefit their work, and the it can be mutually beneficial and long lasting.
Think about other teams when making decisions. The whole Facebook backend is connected together through hardware and software dependencies, and one team’s decision will have an impact on the other teams’ systems. For example, when deciding how many machines are needed for a service, we not only need to consider how much bandwidth each machine can provide, but also if we use 100% of the machine bandwidth, would it saturate the rack’s bandwidth and affect other teams’ services in the same rack? When developing our system, we have on multiple occasions been affected by issues in other systems and they are usually quite difficult to trace and debug, so we must try not to cause trouble for other teams ourselves.
Keep a sense of direction. We need to overcome many hurdles when developing a new system, and we are constantly budgeted by our time and energy. Good engineers not only need to solve problems quickly, but they also need to know what problems are important to solve. This usually involves analyzing the logs, finding the root issues, designing proper solutions, and prioritizing the tasks. Keep a sense a direction while taking care of the infinite engineering details is critical to moving fast in this rapidly changing and ever complex world.
Production and research
I have spent more than three years in my PhD now. All my previous internships had been more or less research oriented, and this was my first internship where I have closely worked with production systems. We designed and implemented a load testing framework for SVE and ran a series of tests that greatly accelerated the progress to finally launch SVE. We implemented RIPQ, a flash-based caching framework in Facebook CDN. Previously it was only a academic prototype, and now it has been rolled out in production. As I put more effort into production systems and start to appreciate all the engineering efforts, I have a deeper understanding of research as well, and the characteristics and strengths of both worlds.
Developing a production system is a comprehensive effort, and we need to get all things working before the whole system can run smoothly. There are often many alternative techniques for the implementation of one functionality, and most of the time, one technique won’t be crucial for the success of the whole system—a suboptimal technique might be slower, require more hardware, or it’s just some quick hack that doesn’t generalize to all situations, but it would still work.
Research on the other hand often focuses on one specific issue seen in production. We find an interesting problem, formulate a clean abstraction about it, and based on this abstraction try to understand and solve a general class of problems really well. Another way of doing research is to start from ideas. In systems for example, good ideas like consistent hashing, bloom filters, lineage, provide solution for an array of different problems.
Production and research seem to be two different worlds, but can they be reconciled together? We must look at the big picture. Although usually no single technique is critical to the success of one production system, when we inspect all its aspects, we find that they often stem from research projects. The contribution of one project to one system might be small, but the aggregate effect on the whole community can be great.
I’m also glad to see Facebook’s open attitude towards research. Many engineers, some without formal research training, take an interest in research projects and devote part of their time to research activities. The desire to discover and share new knowledge is deeply rooted in us after all.