May 7, 2018
A Corpus for Multilingual Document Classification in Eight Languages
Language Resources and Evaluation Conference (LREC)
In this paper, we propose a new subset of the Reuters corpus with balanced class priors for eight languages. By adding Italian, Russian, Japanese and Chinese, we cover languages which are very different with respect to syntax, morphology, etc. We provide strong baselines for all language transfer directions using multilingual word and sentence embeddings respectively. Our goal is to offer a freely available framework to evaluate cross-lingual document classification, and we hope to foster by these means, research in this important area.
By: Holger Schwenk, Xian Li