Models

For now, the following models are available (models in bold can be queried via our web interface):

The table can (and should) be scrolled horizontally!
Persistent identifier Download Corpus Corpus size Vocabulary size Frequency threshold Tagset Algorithm Vector size Window size SimLex965 Google Analogies Creation date
taiga_upos_skipgram_300_2_2018 331 Mbytes Taiga near 5 billion words 237 255 200 Universal Tags Continuous Skipgram 300 2 0.41 0.53 June 2018
ruscorpora_upos_skipgram_300_5_2018 191 Mbytes RNC 250 million words 195 071 20 Universal Tags Continuous Skipgram 300 5 0.37 0.58 January 2018
ruwikiruscorpora_upos_skipgram_300_2_2018 376 Mbytes RNC and  Wikipedia  December 2017 600 million words 384 764 40 Universal Tags Continuous Skipgram 300 2 0.34 0.70 January 2018
news_upos_cbow_600_2_2018 547 Mbytes Russian news, from September 2013 until November 2016 near 5 billion words 289 191 200 Universal Tags Continuous Bag-of-Words 600 2 0.35 0.46 January 2018
araneum_upos_skipgram_300_2_2018 192 Mbytes Araneum about 10 billion words 196 620 400 Universal Tags Continuous Skipgram 300 2 0.38 0.65 January 2018
araneum_none_fasttextcbow_300_5_2018 1 Gbyte Araneum about 10 billion words 195 782 400 None fastText CBOW (3...5-grams) 300 5 0.35 0.73 March 2018
araneum_none_fasttextskipgram_300_5_2018 675 Mbytes Araneum about 10 billion words 195 782 400 None fastText Skipgram (3-grams) 300 5 0.36 0.71 January 2018
ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018 385 Mbytes RNC and  Wikipedia  December 2017 (without merging bigrams) 600 million words 394 332 40 Universal Tags Continuous Skipgram 300 5 0.31 0.72 January 2018
ruwikiruscorpora-superbigrams_skipgram_300_2_2018 735 Mbytes RNC and  Wikipedia  December 2017 (unlimited bigram merging) 600 million words 746 695 40 Universal Tags Continuous Skipgram 300 2 0.28 0.71 January 2018
ruscorpora_upos_skipgram_300_10_2017 200 Mbytes RNC 250 million words 184 973 10 Universal Tags Continuous Skipgram 300 10 0.35 0.65 January 2017
ruwikiruscorpora_upos_cbow_300_20_2017 420 Mbytes RNC and  Wikipedia  November 2016 600 million words 392 339 15 Universal Tags Continuous Bag-of-Words 300 20 0.33 0.70 January 2017
web_upos_cbow_300_20_2017 290 Mbytes Web corpus December 2014 900 million words 267 540 30 Universal Tags Continuous Bag-of-Words 300 20 0.32 0.67 January 2017
news_upos_cbow_300_2_2017 130 Mbytes Russian news, from September 2013 until November 2016 near 5 billion words 194 058 200 Universal Tags Continuous Bag-of-Words 300 2 0.33 0.42 February 2017
araneum_upos_skipgram_600_2_2017 419 Mbytes Araneum about 10 billion words 196 465 400 Universal Tags Continuous Skipgram 600 2 0.41 0.59 June 2017
ruscorpora_upos_skipgram_600_10_2017 371 Mbytes RNC 250 million words 173 816 10 Universal Tags Continuous Skipgram 600 2 0.43 0.29 June 2017
ruscorpora_mystem_cbow_300_2_2015 303 Mbytes RNC 107 million words 281 776 3 Mystem Continuous Bag-of-Words 300 2 0.38 0.27 December 2015
ruwikiruscorpora_mystem_cbow_500_2_2015 1100 Mbytes RNC and Wikipedia  2015 280 million words 604 043 5 Mystem Continuous Bag-of-Words 500 2 0.38 0.38 March 2015
web_mystem_skipgram_500_2_2015 630 Mbytes Web corpus, December 2014 660 million words 353 608 30 Mystem Continuous Skipgram 500 2 0.31 0.52 November 2015
news_mystem_skipgram_1000_20_2015 525 Mbytes Russian news, from September 2013 to October 2015 2.5 billion words 147 358 200 Mystem Continuous Skip-Gram 1000 20 0.32 0.58 December 2015

Corpora

  1. RNC: full Russian National Corpus;
  2. Wikipedia: Russian Wikipedia dump for the corresponding date;
  3. News: news stream from 1 500 primarily Russian-language news sites (about 30 million documents in total in the recent models);
  4. Araneum Russicum Maximum: large web corpus of Russian compiled by Vladimir Benko in 2016;
  5. Taiga: open and structured Russian web corpus with morphological and syntax annotation (poetry subcorpus excluded); compiled by Tatyana Shavrina;
  6. Web: random Russian web pages crawled in December 2014, 9 million documents in total.

Corpus preprocessing

Prior to training, all the corpora were tokenized, split into sentences, lemmatized and PoS-tagged using UDPipe (in case of the Taiga, these steps were done by the corpus creators). PoS tags comply to the Universal PoS Tags format (for instance, "печь_NOUN"). The table of conversion from the Mystem tags to the Universal tags is available here. Stop words (conjunctions, pronouns, prepositions, particles, etc) were removed.

Corpus preprocessing is described in detail in our tutorial (in Russian). You can also use ready-made Python scripts reproducing our pre-processing workflow on arbitrary raw Russian text: UDPipe variant and Mystem variant.

Some strong two-word collocations (bigrams) were merged into 1 token with the special character "::" (especially common for proper names), for instance, борис::гребенщиков_PROPN. For comparison, you can also download the model with no bigrams merged (ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018) and the model in which all the bigrams belonging to productive types were merged, independent of their frequencies (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).

Evaluation

Evaluating performance of distributional semantic models is a difficult problem in itself. The best way is arguably to test the models on the downstream tasks for which they are going to be eventually used (extrinsic evaluation). However, if one trains a general-purpose model (like the ones presented at this service), some intrinsic evaluation methods shoud be used, testing the general ability of the models to capture natural language regularities.

We evaluated our models using two well-known approaches:

  1. Spearman correlation of model-produced pairwise similarities between words against a manually annotated test set. We used the RuSimLex965 test set based on the Multilingual SimLex999 dataset (our paper about RuSimLex965).
  2. Accuracy in solving analogy problems. To test it, we employed the Russian adaptation of Google Analogies Dataset semantic part (translated by Tatyana Kononova; read more about the analogy task).
Download all our test sets (raw or PoS-tagged).

Who uses the RusVectōrēs models?

  1. drafterleo developed a poetic search engine.

  2. Boris Orekhov, associate professor at NRU HSE, created "vector rephrasings" of classic Russian literature works.

  3. opennota developed a semantic search engine over Pushkin's poems.

  4. Sberbank AI employs our models as baseline in their ClassicAI, a shared task on poetry generation.

  5. ...

Have something to add to this list? Contact us!

RusVectōrēs Team

 University of Oslo

Creative Commons License