In this monthly interview series, we turn the spotlight on members of the academic community and the important research they do — as thought partners, collaborators, and independent contributors.
For June, we nominated Balaji Prabhakar, a professor at Stanford University. Through the Stanford Platform Lab, Prabhakar has been an active member of the Facebook Research community for over a decade, along with several other Stanford faculty members.
In this Q&A, Prabhakar shares what the early days of research at Facebook looked like and where he sees their collaboration going in the future. He also explains his most recent project with Facebook, On-Ramp (NSDI 2021).
Q: Tell us about your role at Stanford and the type of research you specialize in.
Balaji Prabhakar: I’m a professor of electrical engineering and computer science; my title is VMware Founders Professor of Computer Science. I also have a courtesy appointment in the Graduate School of Business at Stanford. My research area is data centers and cloud computing—mainly those. I have worked on network theory, algorithms and protocols early in my career, and now I do a lot more systems, building larger-scale systems rather than algorithms that sit in some subsystems.
I have also worked with Facebook on road traffic reduction. This was with Facebook’s transportation lead circa 2011. We were discussing the use of incentives to “nudge” Facebook commuters to travel off-peak or take shuttles — basically, to road reduce traffic congestion, a different type of congestion than computer network congestion.
Q: When and how did your relationship with Facebook start?
BP: It started over a decade ago. Facebook had an office in downtown Palo Alto — which was probably Facebook’s first office, after the house where Facebook started. At the time, Facebook was growing fast, and the social graph was growing literally almost 10x every few months — a gigantic growth rate. And so the question was, what sort of systems (databases, compute, network infrastructure) will be required to support this growth? What will the back end look like at scale? The initial networks were all one gigabit per second, but what happens when we move to larger speeds, like 10 or 40 Gbps? That was the original conversation.
At the time, we had the Stanford Experimental Data Center Lab, of which I was the faculty director. This evolved into the Stanford Platform Lab (SPL). We were talking with Jeff Rothschild, who at the time was overseeing much of the tech stuff that was taking place at Facebook. I think Mike Schroepfer has just been hired, so it was very early days. On the Stanford side, my colleague John Ousterhout was beginning to formulate what became the RAM cloud project. Our colleagues in the lab at that time were Mendel Rosenblum and Chris Kozyrakis. Chris is the faculty director of SPL as of this year.
So that was the very first conversation that led to our first collaboration, and rather quickly it led to a couple of PhD students being interviewed. I remember Berk Atikoglu, my former PhD student, was made an offer to go work at Facebook. What he learned during that experience later became the basis for thesis work.
Read the paperWorkload Analysis of a Large-Scale Key-Value Store
At the time, Berk and his mentor were looking at ways of improving memcached performance, which is important to Facebook’s infrastructure. Luckily for Berk, Facebook’s office on Cambridge Ave. was close to his dorm room, so he could easily go to work at Facebook while staying connected to Stanford. His thesis, which he published as a conference paper, was on the analysis of memcached.
We’ve had fruitful interactions with Facebook like this one on several fronts, since the early days. Because of its size, and the breadth and depth of the technical problems the company deals with, many of my faculty colleagues also have ongoing collaborations with Facebook.
Q: Tell us about your most recent collaboration with Facebook, On-Ramp.
BP: On-Ramp is described in the USENIX NSDI 2021 paper “Breaking the transience-equilibrium nexus: A new approach to datacenter packet transport.” The project with Facebook got going after the paper was accepted at NSDI 2021 but before it was published. We worked with Facebook to test our solution in a real-world testbed; the results were included in the final paper. There was a fair amount of work to do on the Facebook Engineering team’s end, to bring in something from our side and try it, for which we are really grateful.
Read the paperBreaking the Transience-Equilibrium Nexus: A New Approach to Datacenter Packet Transport
So what is On-Ramp? It’s a way of reducing congestion in the network by using very accurately synchronized clocks. With accurately synchronized clocks, which is something we’d developed at Stanford, you can measure exactly how long a data packet has spent in the network. This led us to a very accurate way of measuring congestion down to the sub-microsecond level. The challenge is that as Facebook’s networks grow faster and faster, the amount of time you have to react to congestion decreases.
It’s basically the same concept on the road: The faster vehicles drive, the shorter the reaction time for drivers to stop when they need to. The problem in data networks is exacerbated by the fact that as network speeds go up, the buffer space in switches is actually shrinking because it’s too resource-intensive to have a lot of buffering at those very high speeds. It’s sort of a double whammy — speed’s gone up and space for storing packets in switch buffers has gone down. So congestion must be detected quickly and accurately, and brakes must be applied immediately.
Because On-Ramp is so quick and accurate at measuring network path congestion, it can stop traffic from entering the network once it detects congestion, hence the name On-Ramp. This is like the metering at highway on-ramps: The stoplights let cars onto the highway more slowly if the congestion on the highway is high, and if there’s less traffic, then they’ll let them in faster. It’s pretty much the same idea: If I can very quickly measure network path congestion because I’ve accurate clocks, I can stop stuff at the edge of the network or I can let it go in, depending on whether there is congestion.
Initial testing was done by Facebook engineers for the paper, and the project was encouraging to the point where now an intern is coming to Facebook this summer to work with Facebook’s network engineering team. He’s going to run trials in different scenarios and in different parts of the Facebook infrastructure, study On-Ramp’s effectiveness, and figure out how to solve some of the thorny problems that come up in practical deployments.
Facebook is very, very good at taking academic ideas and trying them out at scale — and, if they work, implementing them. Many times, students go to companies and come back with really interesting ideas, but not necessarily a path to deployment like they may get at Facebook.
Q: How do you see Facebook and Stanford working together to solve challenges for future data center needs?
BP: In the beginning, the relationship was mainly about challenges for designing infrastructure at scale. Then, as we came up with some ideas, Facebook became willing to let us try things out — even in their production network, which was great. And then Facebook joined our industry affiliates program and funded our research. Internal champions are critical for university collaborations to really take hold and be impactful in both directions, so I want to acknowledge those folks like Jay Parikh and Omar Baldonado. Manoj Wadekar kicked off the recent engagement between us. If the direct relationship between engineering teams at Facebook and researchers at universities doesn’t happen, sometimes the glue that binds collaboration breaks loose.
In general, Stanford is in a fortunate situation, given where it’s located. We get to work with a number of companies. We’re also fortunate that some of our former students are now working at these companies. In the future, I’m interested in seeing Facebook engineers spend time at Stanford on a part-time or sabbatical-like basis. Recently, Omar Baldanado from Facebook’s Networking team spent a few weeks at Stanford, and this was a great success. Given that Stanford is so close to Facebook HQ, I’m hoping this will be easy to do. Facebook folks can join group meetings and discussions at Stanford, describe major next-generation challenges that Stanford researchers could tackle, collaborate with them, and potentially take successful outcomes back to Facebook for implementation.
In terms of new projects to work on in the future, a lot of the infrastructure challenges are going either in the machine learning or data direction. Several of my colleagues work on problems related to building systems for machine learning workloads or, conversely, using machine learning to build better systems. These will be big topics in the near future, and I think it’s interesting both to universities and to companies like Facebook.
The pandemic has also revealed several other things, one of which is that video communication is here to stay. This is the future, and as a large tech company, Facebook has a role to play in this future. After all, getting people to connect and communicate is a core component of Facebook’s mission, and if communication shifts to video, then I think it makes sense for Facebook to focus on that. Large-scale video communication at a group level presents some very interesting technical challenges.
Q: Where can people learn more about you and your work?