AI experts from around the world are gathering in Vancouver BC this week for ICLR 2018, the sixth International Conference on Learning Representations, to present the latest advances in AI Learning Representations. Research from Facebook will be presented in peer-reviewed publications and posters. Our researchers and engineers will also be leading and presenting numerous workshops throughout the week.
Join us on LIVE from ICLR 2018
For the first time, we’ll be hosting a Facebook LIVE, and streaming many of the ICLR Main sessions from the ICLR Facebook page, starting Monday at 9:00am PST. Be sure to tune in if you can’t be there in person.
Facebook Research at ICLR 2018:
Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent
Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander Miller, Arthur Szlam, Douwe Kiela, Jason Weston
Contrary to most natural language processing research, which makes use of static datasets, humans learn language interactively, grounded in an environment. In this work we propose an interactive learning procedure called Mechanical Turker Descent (MTD) that trains agents to execute natural language commands grounded in a fantasy text adventure game. In MTD, Turkers compete to train better agents in the short term, and collaborate by sharing their agents’ skills in the long term. This results in a gamified, engaging experience for the Turkers and a better quality teaching signal for the agents compared to static datasets, as the Turkers naturally adapt the training data to the agent’s abilities.
Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, Rob Fergus
We describe a simple scheme that allows an agent to learn about its environment in an unsupervised manner. Our scheme pits two versions of the same agent, Alice and Bob, against one another. Alice proposes a task for Bob to complete; and then Bob attempts to complete the task. In this work we will focus on two kinds of environments: (nearly) reversible environments and environments that can be reset. Alice will “propose” the task by doing a sequence of actions and then Bob must undo or repeat them, respectively. Via an appropriate reward structure, Alice and Bob automatically generate a curriculum of exploration, enabling unsupervised training of the agent. When Bob is deployed on an RL task within the environment, this unsupervised training reduces the number of supervised episodes needed to learn, and in some cases converges to a higher reward.
We let agents who speak different languages play a referential game where they have to pick out the correct image. They learn to translate each other’s language to their own “native language” as they get better at playing the game. This kind of translation, without any parallel data from professional translators, is an interesting new research direction for the community, that has recently been gaining traction. Our approach works really well, considering that there is no aligned data and that our training objective is a referential game, rather than directly learning to translate. Learning to translate like this is natural, as it is similar to how humans learn other languages, i.e., by communicating with each other in a shared (visual) environment. In addition to showing that we can achieve this kind of “emergent” translation in multi-agent settings, we highlight several interesting use-cases, such as learning to translate “alien” languages by showing them pictures, and building and examining linguistic communities.
Emergent Communication in a Multi-Modal, Multi-Step Referential Game
Katrina Evtimova, Andrew Drozdov, Douwe Kiela, Kyunghyun Cho
Inspired by previous work on emergent communication in referential games, we propose a novel multi-modal, multi-step referential game, where the sender and receiver have access to distinct modalities of an object, and their information exchange is bidirectional and of arbitrary duration. The multi-modal multi-step setting allows agents to develop an internal communication significantly closer to natural language, in that they share a single set of messages, and that the length of the conversation may vary according to the difficulty of the task. We examine these properties empirically using a dataset consisting of images and textual descriptions of mammals, where the agents are tasked with identifying the correct object. Our experiments indicate that a robust and efficient communication protocol emerges, where gradual information exchange informs better predictions and higher communication bandwidth improves generalization.
We are able to accurately match between sets of images that come from domains that look very different but are uniquely related, without supervision. Differently from prior works in unsupervised visual analogy learning, we focus on exact matching rather than regression. For example: given a set of images of building facades and a set of colored segmentation maps of the same buildings, we retrieve the matching segmentation map for every building photo and vice versa, without ever seeing a supervised pair of photo and segmentation maps. This is achieved by introducing exemplar-based constraints that encourage accurate exact matching between the image sets. Our method performs significantly better than state of the art methods at matching. This can be used in some cases to significantly outperform previous methods at unsupervised translation. We also outperform the state-of-the-art on large angle point-cloud matching.
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
Yaniv Taigman, Lior Wolf, Adam Polyak, Eliya Nachmani
We present a new neural text to speech (TTS) method that is able to transform text to speech in voices that are sampled “in the wild”. Unlike other systems, our solution is able to deal with unconstrained voice samples. The network architecture is simpler than those in the existing literature and is based on a novel shifting buffer working memory. The speakers are represented by a short vector that can be fitted to new identities, even with only a few samples. Variability in the generated speech is achieved by priming the buffer prior to generating the audio. Experimental results on several datasets demonstrate convincing capabilities, making TTS accessible to a wider range of applications. In order to promote reproducibility, we release our source code and models at GitHub.
We study mathematical conditions under which the models of tempoiral data currently used in machine learning are insensitive to changes of rhythm such as accelerations and decelerations in the signals.
Countering Adversarial Images using Input Transformations
Chuan Guo, Mayank Rana, Moustapha Cisse, Laurens van der Maaten
This paper investigates strategies that defend against adversarial-example attacks on image-classification systems by transforming the inputs before feeding them to the system. Specifically, we study applying image transformations such as bit-depth reduction, JPEG compression, total variance minimization, and image quilting before feeding the image to a convolutional network classifier. Our experiments on ImageNet show that total variance minimization and image quilting are very effective defenses in practice, in particular, when the network is trained on transformed images. The strength of those defenses lies in their non-differentiable nature and their inherent randomness, which makes it difficult for an adversary to circumvent the defenses. Our best defense eliminates 60% of strong white-box and 90% of strong black-box attacks by a variety of major attack methods.
Training on interpolations between training points and their labels leads to better generalization, more robustness to adversarial examples and corrupt labels, and better stability in adversarial training.
State-of-the-art methods for learning cross-lingual word embeddings have relied on bilingual dictionaries or parallel corpora. Recent studies showed that the need for parallel data supervision can be alleviated with character-level information. While these methods showed encouraging results, they are not on par with their supervised counterparts and are limited to pairs of languages sharing a common alphabet. In this work, we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way. Without using any character information, our model even outperforms existing supervised methods on cross-lingual tasks for some language pairs. Our experiments demonstrate that our method works very well also for distant language pairs, like English-Russian or English-Chinese. We finally describe experiments on the English-Esperanto low-resource language pair, on which there only exists a limited amount of parallel data, to show the potential impact of our method in fully unsupervised machine translation. Our code, embeddings and dictionaries are publicly available on GitHub.
Machine translation has recently achieved impressive performance thanks to re-cent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores up to 32.8, without using even a single parallel sentence at training time.
Multi-Scale Dense Networks for Resource Efficient Image Classification
Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, Kilian Weinberger
In this paper we investigate image classification with computational resource limits at test time. Two such settings are: 1. anytime classification, where the network’s prediction for a test example is progressively updated, facilitating the output of a prediction at any time; and 2. budgeted batch classification, where a fixed amount of computation is available to classify a set of examples that can be spent unevenly across “easier” and “harder” inputs. In contrast to most prior work, such as the popular Viola and Jones algorithm, our approach is based on convolutional neural networks. We train multiple classifiers with varying resource demands, which we adaptively apply during test time. To maximally re-use computation between the classifiers, we incorporate them as early-exits into a single deep convolutional neural network and inter-connect them with dense connectivity. To facilitate high quality classification early on, we use a two-dimensional multi-scale network architecture that maintains coarse and fine level features all-throughout the network. Experiments on three image-classification tasks demonstrate that our framework substantially improves the existing state-of-the-art in both settings.
A theoretical paper on the reasons behind the empirical success of certain unsupervised learning methods that have been shown to be successful in mapping between visual domains.
We present graph attention networks (GATs), novel neural network architectures that operate on graph-structured data, leveraging masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. By stacking layers in which nodes are able to attend over their neighborhoods’ features, we enable (implicitly) specifying different weights to different nodes in a neighborhood, without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. In this way, we address several key challenges of spectral-based graph neural net- works simultaneously, and make our model readily applicable to inductive as well as transductive problems. Our GAT models have achieved or matched state-of-the- art results across four established transductive and inductive graph benchmarks: the Cora, Citeseer and Pubmed citation network datasets, as well as a protein- protein interaction dataset (wherein test graphs remain unseen during training).
We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk.
Parametric adversarial divergences are good task losses for generative modeling
Pascal Vincent, Hugo Bérard, Ahmed Touati
We show why a recently proposed family of criteria for training systems to be capable of imagination is better suited for that task than traditionally used criteria.
Residual Connections Encourage Iterative Inference
Residual networks (Resnets) have become a prominent architecture in deep learning. However, a comprehensive understanding of Resnets is still a topic of ongoing research. A recent view argues that Resnets perform iterative refinement of features. We attempt to further expose properties of this aspect. To this end, we study Resnets both analytically and empirically. We formalize the notion of iterative refinement in Resnets by showing that residual architectures naturally encourage features to move along the negative gradient of loss during the feedforward phase. In addition, our empirical analysis suggests that Resnets are able to perform both representation learning and iterative refinement. In general, a Resnet block tends to concentrate representation learning behavior in the first few layers while higher layers perform iterative refinement of features. Finally we observe that sharing residual layers naively leads to representation explosion and hurts generalization performance, and show that simple existing strategies can help alleviating this problem.
We consider the problem of learning a one-hidden-layer neural network with Gaussian inputs. We design a non-convex objective function $G(cdot)$ with nice landscape properties: 1. All local minima of $G$ are also global minima. 2. All global minima of $G$ correspond to the ground truth parameters.3. The value and gradient of $G$ can be estimated using samples. With these properties, we show that stochastic gradient descent on $G$ can converge to the global minimum and learn the ground-truth parameters. We also prove finite sample complexity result and simulations.
Social dilemmas, where mutual cooperation can lead to high payoffs but participants face incentives to cheat, are ubiquitous in multi-agent interaction. We wish to construct agents that cooperate with pure cooperators, avoid exploitation by pure defectors, and incentivize cooperation from the rest. However, often the actions taken by a partner are (partially) unobserved or the consequences of individual actions are hard to predict. We show that in a large class of games good strategies can be constructed by conditioning one’s behavior solely on outcomes (ie. one’s past rewards). We call this consequentialist conditional cooperation. We show how to construct such strategies and show, both analytically and experimentally, that they are effective in social dilemmas beyond simple matrix games. We also show the limitations of relying purely on consequences and discuss the need for understanding both the consequences of and the intentions behind an action.
This paper analyzes the dynamics of gradient descent in one-layered network with non-Gaussian input, and shows that gradient descent can still converges, opening the door of a novel way to prove the convergence in the context of non-convex models like neural networks.
This paper proposes House3D, an interactive environment that allows a virtual agent to interact with rich and diverse 3D scenes with fully labeled objects. The environment is built from SUNCG dataset, and is efficient in terms of simulation and flexible in design. Agents trained to navigate to a high-level concept (e.g., kitchen) using multiple baseline approaches and House3D have shown generalization capability over unseen environments (i.e., can move to a room labeled by a given concept in an unseen environment).
We present a method – NAM – to map images across domains without using cycles or GANs (between the domains). NAM requires as input a pre-trained unconditional generator for the target domain. Our method works well on unsupervised image translation tasks, and is able to output a variety of correctly mapped images for each input image. The visual quality of NAM results is typically better than competing method, and it also has advantages in terms of training speed, stability and sample complexity.
We consider the problem of detecting out-of-distribution images in neural networks. We propose ODIN, a simple and effective method that does not require any change to a pre-trained neural network. Our method is based on the observation that using temperature scaling and adding small perturbations to the input can separate the softmax score distributions between in- and out-of-distribution images, allowing for more effective detection. We show in a series of experiments that ODIN is compatible with diverse network architectures and datasets. It consistently outperforms the baseline approach (Hendrycks & Gimpel, 2017) by a large margin, establishing a new state-of-the-art performance on this task. For example, ODIN reduces the false positive rate from the baseline 34.7% to 4.3% on the DenseNet (applied to CIFAR-10 and Tiny-ImageNet) when the true positive rate is 95%.
We investigate mathematical tricks to be able to train an artificial intelligence system in less time.
In this paper, we propose a general framework to discover the causal direction between two entities of any type, for example “virus” and “death”, or a painting and its forgery. We then run experiments with practical implementations of our method, with good empirical results.