LASER is a library to calculate multilingual sentence embeddings.

Currently, we include an encoder which supports nine European languages:

  • Germanic languages: English, German, Dutch, Danish
  • Romanic languages: French, Spanish, Italian, Portuguese
  • Uralic languages: Finnish

All these languages are encoded by the same BLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder supports code-switching: i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder generalizes somehow to other languages of the Germanic and Romanic language families (e.g. Swedish, Norwegian, Afrikaans, Catalan or Corsican), although no data of these languages was used during training.

We showcase several applications of multilingual sentence embeddings.

License

The LASER source code is licensed under the license found in the LICENSE file in the root directory of the source tree on GitHub.

References

Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space, ACL, July 2018