December 12, 2017

Building Facebook’s platform for large-scale AR experiences

By: Alvaro Collet, Krishnan Ramnath

Facebook is taking another step towards bringing Augmented Reality (AR) into the everyday experiences of people around the globe. Earlier this year we announced the concept of AR Studio, a tool that enables developers to build animated frames, face masks, and interactive effects. Today we opened availability of AR Studio to everyone, and as part of this, we’re enabling world effect technology. This will let creators build effects where people can place 3D objects onto their surroundings and interact with them in real-time.

While AR effects like selfie masks and interactive filters are not new, their distribution and use has been siloed and limited to specific hardware devices and computing capabilities. What’s new and exciting for us about the launch of world effects is how our research team made AR experiences accessible to run across a wide range of mobile devices, platforms and bandwidths.

A big challenge the engineering team faced was in deploying a great experience for everyone using Facebook. The spectrum of mobile devices used to access the Facebook family of apps is very heterogeneous, ranging from high-end phones with high quality cameras (e.g., iPhoneX, Pixel 2 XL), to older devices with very limited computing power and resources. To ensure everyone using Facebook has a similar experience regardless of device, the team designed a scalable system, built using an ensemble of tracking algorithms. Depending on the capabilities of the device, we choose a subset of trackers to run with parameters tuned for the performance, memory, and quality limitations.

Understanding device capabilities

The key to deploying AR experiences on a worldwide scale is the graceful tailoring of algorithms to different mobile device capabilities. The considerations while designing the tracking algorithms are three-fold: first is modeling camera characteristics such as rolling shutter artifacts, lens distortion, and motion blur; second is fusing different sensing modalities such as camera and inertial sensors; and finally, adapting to the device capabilities related to memory, processing power and storage.

To address the variation in camera characteristics, the team calibrated a variety of devices to compensate for rolling shutter, and lens distortion. The calibration process involves both an offline component to estimate the camera intrinsics and an online component that actively compensates for rolling shutter. Motion blur is detected through inertial sensors which we use to switch to inertial tracking when there is excessive motion blur. To provide seamless transitions between camera and inertial tracking algorithms, these sensors also have to be calibrated accordingly. While some high-end phones already provide this synchronization in their hardware, many older devices do not. To accommodate this the online calibration process keeps these data in sync.

Since world effects will run on devices with different capabilities, we knew that a one-size-fits-all approach would result in compromised quality. Our solution was to design a framework that allows for both graceful degradation and switching between the algorithms to account for device capability variations. Optimizing across the different axes such as processing power, memory and quality is a challenging task, and we needed to provide the highest quality of tracking based on these parameters. To achieve this, we created four different tracking algorithms that all have complementary modes of operation: Simultaneous Localization and Mapping (SLAM)-based tracking, appearance-based region tracking, inertial tracking and face geometry-based tracking.

SLAM is an algorithm that simultaneously estimates a map of the environment and camera position in real-time. Having a SLAM system capable of running at 60 Hz on mobile devices is hard. For example, every 16 milliseconds your phone has to capture an image, find hundreds of interesting key points, match them with the same points in the previous frame, and then use trigonometry to determine where each of these points is in 3D space. This required extensive fine-grained optimization and rethinking of how these algorithms operate.

While SLAM operates by using features to estimate scene geometry, a complementary tracking technology to SLAM is an appearance-based tracker, that makes no assumptions about the scene geometry and tracks purely based on the appearance of a particular region of pixels. SLAM is more capable in static scenes, and the appearance-based tracker is more robust to image artifacts and changes in the scene. The appearance-based tracker operates without calibration and has wide applicability, as it can also be used to track moving objects. The appearance-based tracker allows us to trade-off speed vs accuracy, and hence can run on low-end devices with some loss in accuracy.

The inertial tracker relies on attitude readings from the inertial sensors (gyroscope, accelerometer and magnetometer) to provide accurate rotational phone motion, and is useful as a fallback in cases where we have less accurate pixel information such as with motion blur or surfaces with little texture. Not all devices have the hardware support for inertial tracking, but for supported devices inertial tracking is extremely fast.

Finally, specialized trackers can also be used to track objects of known geometry. One such tracker is the same face tracker that powers Facebook’s selfie mask effects, which is already highly optimized to run on low-end devices.

Having multiple trackers provides the ability to handle a variety of different situations and use cases. But it also means that we have to be able to efficiently decide which combinations to use to deliver the best results. The system offers the flexibility to adjust to each user’s mobile device.

Placing things in the real world

The world effects framework is the umbrella interface that combines the different tracking algorithms to “place things in the world.” It contains the logic that runs a combination of algorithms and dynamically switches between them as the scene and processing requirements vary. The framework also allows for switching off different algorithms if the memory pressure on the device is high. As a result, high-end devices can run the full ensemble of algorithms with the best tracking quality, whereas for lower end devices, tracking still delivers a reasonable experience based on the limitations of the device.

This launch is a milestone in our journey to change the way we experience and share the world around us. By putting the power of AR in the hands of all creators, and enabling them to deploy across Facebook’s global community, we hope to accelerate the creation of amazing and fun new AR experiences for everyone on Facebook.