Publication

Learning Word Vectors for 157 Languages

International Conference on Language Resources and Evaluation (LREC)


Abstract

Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

Related Publications

All Publications

Robust Market Equilibria with Uncertain Preferences

Riley Murray, Christian Kroer, Alex Peysakhovich, Parikshit Shah

AAAI - February 12, 2020

Weak-Attention Suppression For Transformer Based Speech Recognition

Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

Interspeech - October 26, 2020

Machine Learning in Compilers: Past, Present, and Future

Hugh Leather, Chris Cummins

FDL - September 14, 2020

Unsupervised Cross-Domain Singing Voice Conversion

Adam Polyak, Lior Wolf, Yossi Adi, Yaniv Taigman

Interspeech - August 8, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy