January 25, 2018

Enabling full body AR with Mask R-CNN2Go

By: Amit Jindal, Andrew Tulloch, Ben Sharma, Bram Wasti, Fei Yang, Georgia Gkioxari, Jaeyoun Kim, Jason Harrison, Jerry Zhang, Kaiming He, Orion Reblitz-Richardson, Peizhao Zhang, Peter Vajda, Piotr Dollar, Pradheep Elango, Priyam Chatterjee, Rahul Nallamothu, Ross Girshick, Sam Tsai, Su Xue, Vincent Cheung, Yanghan Wang, Yangqing Jia, Zijian He

The Facebook AI Camera Team is working on various computer vision technologies and creative tools to help people express themselves. For example, with real-time “style transfer”, you can give your photos or videos the look of a Van Gogh painting. With real-time face tracker, you can add makeup or even replace your face with an avatar. So what if you could replace your entire body with an avatar?

To replace the entire body with an avatar, we will need to accurately detect and track body movements in real time. This is a very challenging problem due to large variations in poses and identities. A person might be sitting, walking or running. She or he might be wearing a long coat or shorts. And a person is often obstructed by other people or objects. All of these factors dramatically increase the difficulty of a robust body tracking system.

We recently developed a new technology that can accurately detect body poses and segment a person from their background. Our model is still in research phase at the moment, but it is only a few megabytes, and can run on smart phones in real time. Someday, it could enable new applications many new applications such as creating body masks, using gestures to control games, or de-identifying people.

MaskRCNN2Go Architecture

Our human detection and segmentation model is based on the Mask R-CNN framework — a conceptually simple, flexible, and general framework for object detection and segmentation. It can efficiently detect objects in an image, while simultaneously predicting key points and generating a segmentation mask for each object. The Mask R-CNN framework won the best paper award in ICCV 2017. To run Mask R-CNN models in realtime in mobile devices, researchers and engineers from Camera, FAIR and AML teams work together and build an efficient and light-weighted framework: Mask R-CNN2Go.

Mask R-CNN2Go model consists of five major components.

  1. The trunk model contains multiple convolutional layers, and generates deep feature representations of the input image.
  2. A region proposal network (RPN) proposes candidate objects at predefined scales and aspect ratios (anchor points). A ROI-Align layer extracts features from each object bounding box and sends them to the detection head.
  3. The detection head contains a set of convolution, pooling, and fully-connected layers. For each candidate box, it predicts how likely the object is a person. The detection head also refines the bounding box coordinates, groups neighboring boxes with non-max suppression, and generates a final bounding box for each person in the image.
  4. With bounding boxes of each person, we use a second ROI-Align layer to extract features, which are inputs of key point head and segmentation head.
  5. The key point head has a similar architecture as the segmentation head. It predicts a mask for each predefined key point on body. A single maximum sweeping is used to generate final coordinates.


Lightweight model optimized for mobile devices

Unlike modern GPU servers, mobile phones have limited computational power and storage. The original Mask R-CNN model is based on ResNet, which is too big and too slow to run on mobile phones. To solve this problem, we developed a very efficient model architecture optimized for mobile devices.

We applied several approaches for reducing the model size. We optimize the number of convolution layers and the width of each layer, which is the most time-consuming part of processing. To ensure a large enough receptive field, we use a combination of kernel sizes including 1×1, 3×3 and 5×5. Weight pruning is also used to reduce the size. Our final model is only a few megabytes and is very accurate.

Modular design improves computation speed

To run deep learning algorithms real-time we leverage and optimize our core framework, Caffe2 with NNPack, SNPE and Metal. By utilizing a mobile CPU and GPU libraries including NNPack, SNPE and Metal, we are able to significantly improve the mobile computation speed. All of these are done with a modular design, without changing the general model definition. As a result, we get both small model size and fast runtime, and avoid potential incompatibilities.

Facebook AI Research (FAIR) recently published the Mask R-CNN research platform (Detectron). We have open-sourced implementation of Caffe2 operators (GenerateProposalsOp, BBoxTransformOp, BoxWithNMSLimit, and RoIAlignOp) and necessary model conversion code for model inference for the community to use.

What’s next

Developing computer vision models for mobile devices is a challenging task. A mobile model has to be small, fast and accurate without large memory requirements. We will continue exploring new model architectures which will lead to more efficient models. We will also explore models that can better fit in mobile GPUs and DSPs which has the potential to save both the battery and computational power.