ACM SIGGRAPH’S 46th International Conference and Exhibition on Computer Graphics and Interactive Techniques is taking place this year from July 28 to August 1 in Los Angeles. Facebook researchers in AI and AR/VR are presenting their work in oral spotlights and poster sessions.
On Monday, July 29, Facebook’s Director of Computational Photography, Michael F. Cohen, is being presented the Steven A. Coons Award for Outstanding Creative Contributions to Computer Graphics. Read more about Cohen’s background, career, and current research in our Q&A.
Facebook research being presented at SIGGRAPH
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabe Schwartz, Andreas Lehrmann, and Yaser Sheikh
Modeling and rendering of dynamic scenes is challenging, as natural scenes often contain complex phenomena such as thin structures, evolving topology, translucency, scattering, occlusion, and biological motion. Mesh-based reconstruction and tracking often fail in these cases, and other approaches (e.g., light field video) typically rely on constrained viewing conditions, which limit interactivity. We circumvent these difficulties by presenting a learning-based approach to representing dynamic objects inspired by the integral projection model used in tomographic imaging. The approach is supervised directly from 2D images in a multiview capture setting and does not require explicit reconstruction or tracking of the object. Our method has two primary components: an encoder-decoder network that transforms input images into a 3D volume representation, and a differentiable ray-marching operation that enables end-to-end training. By virtue of its 3D representation, our construction extrapolates better to novel viewpoints compared to screen-space rendering techniques. The encoder-decoder architecture learns a latent representation of a dynamic scene that enables us to produce novel content sequences not seen during training. To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu-Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley
We present a new algorithm to automatically schedule Halide programs for high-performance image processing and deep learning. We significantly improve upon the performance of previous methods, which considered a limited subset of schedules. We define a parameterization of possible schedules much larger than prior methods and use a variant of beam search to search over it. The search optimizes runtime predicted by a cost model based on a combination of new derived features and machine learning. We train the cost model by generating and featurizing hundreds of thousands of random programs and schedules. We show that this approach operates effectively with or without autotuning. It produces schedules which are on average almost twice as fast as the existing Halide autoscheduler without autotuning, or more than twice as fast with, and is the first automatic scheduling algorithm to significantly outperform human experts on average.
Shih-En Wei, Jason Saragih, Tomas Simon, Adam W. Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh
A key promise of virtual reality (VR) is the possibility of remote social interaction that is more immersive than any prior telecommunication media. However, existing social VR experiences are mediated by inauthentic digital representations of the user (i.e., stylized avatars). These stylized representations have limited the adoption of social VR applications in precisely those cases where immersion is most necessary (e.g., professional interactions and intimate conversations). In this work, we present a bidirectional system that can animate avatar heads of both users’ full likeness using consumer-friendly headset-mounted cameras (HMC). There are two main challenges in doing this: unaccommodating camera views and the image-to-avatar domain gap. We address both challenges by leveraging constraints imposed by multiview geometry to establish precise image-to-avatar correspondence, which are then used to learn an end-to-end model for real-time tracking. We present designs for a training HMC, aimed at data collection and model building, and a tracking HMC for use during interactions in VR. Correspondence between the avatar and the HMC-acquired images is automatically found through self-supervised multiview image translation, which does not require manual annotation or one-to-one correspondence between domains. We evaluate the system on a variety of users and demonstrate significant improvements over prior work.
Synthetic Defocus and Look-Ahead Autofocus for Casual Videography
Kevin Matzen, Cecilia Zhang, Dillon Yao, Ren Ng, Vivien Nguyen, and You Zhang
In cinema, large camera lenses create beautiful shallow depth of field (DOF) but make focusing difficult and expensive. Accurate cinema focus usually relies on a script and a person to control focus in real time. Casual videographers often crave cinematic focus but fail to achieve it. We either sacrifice shallow DOF, as in smartphone videos, or we struggle to deliver accurate focus, as in videos from larger cameras. This paper is about a new approach in the pursuit of cinematic focus for casual videography. We present a system that synthetically renders refocusable video from a deep DOF video shot with a smartphone, and analyzes future video frames to deliver context-aware autofocus for the current frame. To create refocusable video, we extend recent machine learning methods designed for still photography, contributing a new data set for machine training, a rendering model better suited to cinema focus, and a filtering solution for temporal coherence. To choose focus accurately for each frame, we demonstrate autofocus that looks at upcoming video frames and applies AI-assist modules such as motion, face, audio, and saliency detection. We also show that autofocus benefits from machine learning and a large-scale video data set with focus annotation, where we use our RVR-LAAF GUI to create this sizable data set efficiently. We deliver, for example, a shallow DOF video where the autofocus transitions onto each person before she begins to speak. This is impossible for conventional camera autofocus because it would require seeing into the future.
DeepFovea: Universal Neural Reconstruction for Foveated Rendering and Video Compression Using Learned Statistics of Natural Videos
Anton Kaplanyan, Anton Sochenov, Gizem Küçükoglu, Mikhail Okunev, and Todd Goodall
In order to provide an immersive visual experience, modern displays require head mounting, high image resolution, low latency, as well as high refresh rate. This poses a challenging computational problem. On the other hand, the human visual system can consume only a tiny fraction of this video stream due to drastic acuity loss in the peripheral vision. Foveated rendering and compression can save computations by reducing the image quality in the peripheral vision. However, this can cause noticeable artifacts in the periphery, or, if done conservatively, would provide only modest savings. In this work, we explore a novel foveated reconstruction method that employs the recent advances in generative adversarial neural networks. We reconstruct a plausible peripheral video out of a small fraction of pixels provided per frame. Our novel approach projects a sparse input stream of pixels onto the learned manifold of natural videos. Our method is both more efficient than the state-of-the-art foveated rendering as well as on par with modern video compression in foveated scenario, while providing the visual experience with no noticeable quality degradation in both scenarios. We conducted a user study to validate our reconstruction method and compare it against two existing foveated rendering and video compression techniques. Our method is fast enough to drive gaze-contingent head-mounted displays in real time on modern hardware. We plan to publish the trained network to establish a new quality bar for foveated rendering and compression as well as to encourage follow-up research.
Other activities at SIGGRAPH
Talk: Machine Learning for Multiple Scattering
Contributors: Feng Xie (speaker), Anton Kaplanyan, Warren Hunt, and Pat Hanrahan
Course: Capture4VR: From VR Photography to VR Video
Lecturers: Peter Hedman, Ryan S. Overbeck, Brian Cabral, Robert Konrad, and Steve Sullivan