OpenEDS 2020 Challenge

About the challenge

With the advent of consumer virtual reality (VR) products, immersive technology in the form of VR and augmented reality (AR) is gaining mainstream attention. However, from a market adoption perspective, immersive technology is still in its infancy. Both users and developers are devising the right recipe for the technology to garner mass appeal. We posit that eye tracking, a technology that measures where an individual is looking and can enable inference of user attention, is a key driver of mass appeal for these immersive technologies.

To that end, we organized the first workshop and an accompanying competition on Eye Tracking for VR and AR at ICCV 2019. The purpose was to increase awareness of AR/VR eye-tracking challenges in the broader computer vision and machine learning researcher communities. Following the overwhelmingly positive feedback and challenge participation, Facebook Reality Labs (FRL), in collaboration with our academic partners from team-GAZE, will organize a joint full-day workshop at ECCV 2020 in Glasgow, UK. In addition, we are hosting a second iteration of the OpenEDS-Eye Tracking for VR and AR Challenge, which is designed to bridge the scale and generalization aspects of developing eye tracking systems. For the challenge, we are releasing a data set of temporal sequences of synchronized eye images and gaze vectors captured using a VR headset. The paper describing the data set will be made available by April 30.

We invite machine learning and computer vision researchers to participate in this challenge.

Performance Tracks

Track-1 Gaze Prediction Challenge: Various applications for eye tracking, such as foveated rending (FR) and gaze-based interaction benefit from low latency gaze estimates. FR is a technique that presents a high-quality picture at the point where a user is looking, while reducing the quality of the picture in the periphery. FR is a critical application for VR/AR platforms because it allows for a substantial reduction in power consumption of the graphical pipeline without reducing the perceptual quality of the generated picture. However, fast eye movements present a challenge for FR due to the transmission and processing delays present in the eye tracking and graphical pipelines. If the pipelines do not compensate for the delays, fast eye movements can take the user’s gaze to the areas of the image that are rendered with low quality, thus degrading the user’s experience. Among the ways of remedying this issue are a reduction of delays, which is not always possible; predicting future gaze locations, thus compensating for the delays; or a combination of both.

Prediction of future gaze locations can be done based on previously estimated gaze-points, understanding of the content of the presented scene (i.e., visual saliency), or a combination of both. Considering real-time requirements of FR and its goal of reducing power consumption, the prediction of future gaze points based on a short subsequence of the already-estimated gaze locations is considered the most fruitful path of exploration. If predicting future gaze locations with high accuracy is feasible, it would allow an implementation of an FR method that would match closely with the human visual acuity function. As a result, it could encode only a very small part of the image at a high-quality resolution, providing the highest level of energy savings.

Due to above-mentioned considerations, the challenge calls for the following:

  1. Predicting future gaze locations based on the previously estimated gaze vectors
  2. Predicting future gaze locations based on the previously estimated gaze vectors while additionally leveraging spatio-temporal information encoded in sequence of previously recorded eye-images

Track-2 Sparse Temporal Semantic Segmentation Challenge: Many eye-tracking solutions require accurate estimation of eye features from 2D eye images. Typically, this is done via per-pixel segmentation (also referred to as semantic segmentation) of the key eye regions: the sclera, the iris, and the pupil. To generalize models for per-pixel segmentation of unseen eye images from a diverse population under different eye-states (fully open, half-open, closed) and different make-up conditions, the model training stage for per-pixel segmentation requires large-scale, hand-annotated training data that can be costly and time-consuming. However, it is easy to obtain a data-acquisition setup that captures medium-to-short duration (5–25 seconds) video sequences of eye images and manually label a handful of images (~5%) for each sequence.

The challenge then is to solve label propagation with a limited amount of samples per sequence. Solving this problem allows us to have a large set of annotations without spending large amounts of resources on human annotation. We posit that the small fraction of hand-annotated labels can be accurately propagated to the rest of the images in the sequence by leveraging temporal information along with geometric and photometric consistencies arising from the eye-images of the same person. Such approaches call for innovative algorithms to leverage the aforementioned cues. Some promising directions could be the following:

  • Temporal co-segmentation using deep learning
  • Temporal few-shot learning framework
  • Learning and respecting the natural representation, including the geometry, of human eyes for temporal label propagation
  • Leveraging synthetic data generation if and where appropriate (such as UnityEyes, NVGaze)

Data set description

OpenEDS2020 is a data set of eye image sequences captured using a VR headset with two synchronized eye-facing cameras at a frame rate of 100Hz under controlled illumination. The data set is divided in two subsets, one for each performance track.

Track-1 Gaze Prediction Challenge

  • 8,960 video sequences gathered from 80 participants, with 55-100 frames/sequence, for a total of 550,400 frames.
  • Each sequence contains eye images at a resolution of 400×640 pixels and respective 3D ground truth gaze vectors for each frame. For the evaluation stage, participants are requested to predict the gaze vector of the last five frames of a 55-frame sequence using up to 50 previous eye images. Thus, all ground truth gaze vectors and the last five eye images of each test sequence are hidden from the participants.

Track-2 Sparse Temporal Semantic Segmentation Challenge

  • 200 video sequences sampled @5Hz from the original data with ~150 frames/sequence for a total of 29,476 frames.
  • Around five percent of each sequence’s frames contain the semantic segmentation label, for a total of 1,605 semantic segmentation masks. For evaluation purposes, five frames from each sequence, whose indices are hidden from the participants, will be used for evaluation.
  • The sequences are gathered from 74 participants.

Participation

In order to access the OpenEDS data set and participate in the challenge, please do the following:

1. Read the Official Rules for the Sparse Semantic Segmentation Challenge or the Gaze Prediction Challenge. The Official Rules are a binding contract that govern your use of OpenEDS and are linked below:

2. Submit the following information to openeds2020@fb.com to request access to the data set (OpenEDS):

Name:
Job title:
Institution:
Contact email:
Members of your team (if applicable):

By submitting your request to access OpenEDS, you agree to the Official Rules for the challenge that you are participating in.

3. Create an account at evalAI.cloudcv.org and register your team for one or both of the following two challenges:

  • Gaze Prediction Challenge
  • Sparse Temporal Semantic Segmentation Challenge

4. Develop your algorithm with the help of the training data and validation data available as part of OpenEDS.

5. Generate SUBMISSION JSON file.

For the Gaze Prediction Challenge, generate a JSON file for the results produced by your Model as applied to the data set. The result consists of a five-by-three (5x3) array, containing the three-dimensional prediction vector of each future frame. Create the JSON file with the model generated gaze estimates as follows:

{
 "sequence_ID1": {
      "50": [0.3, 0.2, -0.98],
      "51": [0.2, 0.21, -0.8],  
      "52": [...],
      “53”: [...],  
      “54”: [...] },
 "sequence_ID2": {
      "50": [0.3, 0.2, -0.98], 
      "51": [0.2, 0.1, -0.91],
      "52": [...],
      “53”: [...],  
      “54”: [...] },
...
 }

Note that the “sequence_ID*” should match the exact naming as in the ground truth data format. The keys of future frames should follow the naming of “50,” “51,” “52,” “53,” “54.”

For the Sparse Semantic Segmentation Challenge, participants are required to generate the semantic segmentation masks for all the images provided in the data set, except for the images whose labels were already provided. The labels are Pupil (3), Iris(2), Sclera(1) and Background(0). The generated semantic segmentation masks should be saved in uint8 as a pythony .npy binary file, one file per semantic segmentation mask.

The participants are required to save their generated masks in <OUTPUT_FOLDER> following the same sub-folder structure and naming convention in which the input data set was provided and generate a file name <OUTPUT_FOLDER>/output.txt that has one output file name per line. For example, the generated semantic segmentation mask for S_1/3.png should be saved in the folder output/S_1/3.npy and the output.txt file should have an entry as S_1/3.npy

Once the generated semantic segmentation files are ready for submission, the participants are required to run the provided script:

python create_json_ss.py --root_folder <OUTPUT_FOLDER> --submission-json <SUBMISSION JSON> At the evaluation time

6. Upload JSON file to the evalAI challenge portal. The scores will be made available on the leadership board after evaluation.

Prizes

The following is a summary. Please see the Official Rules for full details.

Winners will be notified on August 10, 2020. Prize money will be distributed to winners (or winning teams) at the OpenEyes Workshop, currently scheduled for <TBD> at the ECCV conference site in Glasgow, UK.

The first place winners will receive $5,000 USD, plus travel for one person (either an Individual Entrant or the Representative of the winning Team, as defined in the Official Rules) to attend and present their model at the Facebook-organized ECCV Workshop.

The second place winners will receive $3,000 USD.

The third place winners will receive $2,000 USD.

Timeline

  • Challenge participation deadline: July 31, 2020
  • Notifications to winners: August 10, 2020
  • Winner announcement and prize distribution: TBD

People

Robert Cavin
Facebook Reality Labs

Jixu Chen
Facebook

Alexander Fix
Facebook Reality Labs

Elias Guestrin
Facebook Reality Labs

Oleg Komogortsev, Visiting Professor
Facebook Reality Labs

Kapil Krishnakumar
Facebook

Tarek Hefny
Facebook Reality Labs

Karsten Behrendt
Facebook

Cristina Palmero, Intern
Facebook Reality Labs

Abhishek Sharma
Facebook Reality Labs

Yiru Shen
Facebook Reality Labs

Sachin S. Talathi
Facebook Reality Labs