Mining 135M parallel sentences in 1620 language pairs from Wikipedia

WikiMatrix is a corpus of parallel sentences used in the project outlined in WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. The goal of this project is to mine for parallel sentences in the textual content of Wikipedia for all possible language pairs. The paper presents an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 85 languages, including several dialects or low-resource languages.

We use LASER’s bitext mining approach and encoder for 93 languages [2,3]. We do not use the inter-language links provided by Wikipedia, but search over all Wikipedia articles of each language. We approach the computational challenge to mine in almost 600 million sentences by using fast indexing and similarity search with FAISS. Prior to mining parallel sentences, we perform sentence segmentation, deduplication and language identification.