For now, the following models are available (models in bold are available via the web interface):
|Identifier||Download||Corpus||Corpus size||Vocabulary size||Frequency threshold||Tagset||Algorithm||Vector size||Window size||SimLex965||Google Analogies||Creation date|
|taiga_upos_skipgram_300_2_2018||331 Mbytes||Taiga||near 5 billion words||237 255||200||Universal Tags||Continuous Skipgram||300||2||0.41||0.53||June 2018|
|ruscorpora_upos_skipgram_300_5_2018||191 Mbytes||RNC||250 million words||195 071||20||Universal Tags||Continuous Skipgram||300||5||0.37||0.58||January 2018|
|ruwikiruscorpora_upos_skipgram_300_2_2018||376 Mbytes||RNC and Wikipedia December 2017||600 million words||384 764||40||Universal Tags||Continuous Skipgram||300||2||0.34||0.70||January 2018|
|news_upos_cbow_600_2_2018||547 Mbytes||Russian news, from September 2013 until November 2016||near 5 billion words||289 191||200||Universal Tags||Continuous Bag-of-Words||600||2||0.35||0.46||January 2018|
|araneum_upos_skipgram_300_2_2018||192 Mbytes||Araneum Russicum Maximum||about 10 billion words||196 620||400||Universal Tags||Continuous Skipgram||300||2||0.38||0.65||January 2018|
|araneum_none_fasttextcbow_300_5_2018||1 Gbyte||Araneum Russicum Maximum||about 10 billion words||195 782||400||None||fastText CBOW (3...5-grams)||300||5||0.35||0.73||March 2018|
|araneum_none_fasttextskipgram_300_5_2018||675 Mbytes||Araneum Russicum Maximum||about 10 billion words||195 782||400||None||fastText Skipgram (3-grams)||300||5||0.36||0.71||January 2018|
|ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018||385 Mbytes||RNC and Wikipedia December 2017 (without merging bigrams)||600 million words||394 332||40||Universal Tags||Continuous Skipgram||300||5||0.31||0.72||January 2018|
|ruwikiruscorpora-superbigrams_skipgram_300_2_2018||735 Mbytes||RNC and Wikipedia December 2017 (unlimited bigram merging)||600 million words||746 695||40||Universal Tags||Continuous Skipgram||300||2||0.28||0.71||January 2018|
|ruscorpora_upos_skipgram_300_10_2017||200 Mbytes||RNC||250 million words||184 973||10||Universal Tags||Continuous Skipgram||300||10||0.35||0.65||January 2017|
|ruwikiruscorpora_upos_cbow_300_20_2017||420 Mbytes||RNC and Wikipedia November 2016||600 million words||392 339||15||Universal Tags||Continuous Bag-of-Words||300||20||0.33||0.70||January 2017|
|web_upos_cbow_300_20_2017||290 Mbytes||Web corpus December 2014||900 million words||267 540||30||Universal Tags||Continuous Bag-of-Words||300||20||0.32||0.67||January 2017|
|news_upos_cbow_300_2_2017||130 Mbytes||Russian news, from September 2013 until November 2016||near 5 billion words||194 058||200||Universal Tags||Continuous Bag-of-Words||300||2||0.33||0.42||February 2017|
|araneum_upos_skipgram_600_2_2017||419 Mbytes||Araneum Russicum Maximum||about 10 billion words||196 465||400||Universal Tags||Continuous Skipgram||600||2||0.41||0.59||June 2017|
|ruscorpora_upos_skipgram_600_10_2017||371 Mbytes||RNC||250 million words||173 816||10||Universal Tags||Continuous Skipgram||600||2||0.43||0.29||June 2017|
|ruscorpora_mystem_cbow_300_2_2015||303 Mbytes||RNC||107 million words||281 776||3||Mystem||Continuous Bag-of-Words||300||2||0.38||0.27||December 2015|
|ruwikiruscorpora_mystem_cbow_500_2_2015||1100 Mbytes||RNC and Wikipedia 2015||280 million words||604 043||5||Mystem||Continuous Bag-of-Words||500||2||0.38||0.38||March 2015|
|web_mystem_skipgram_500_2_2015||630 Mbytes||Web corpus, December 2014||660 million words||353 608||30||Mystem||Continuous Skipgram||500||2||0.31||0.52||November 2015|
|news_mystem_skipgram_1000_20_2015||525 Mbytes||Russian news, from September 2013 to October 2015||2.5 billion words||147 358||200||Mystem||Continuous Skip-Gram||1000||20||0.32||0.58||December 2015|
Prior to training, all the corpora were tokenized, split into sentences, lemmatized and PoS-tagged using UDPipe (in case of the Taiga, these steps were done by the corpus creators). PoS tags comply to the Universal PoS Tags format (for instance, "печь_NOUN"). The table of conversion from the Mystem tags to the Universal tags is available here. Stop words (conjunctions, pronouns, prepositions, particles, etc) were removed.
Corpus preprocessing is described in detail in our tutorial (in Russian). You can also use ready-made Python scripts reproducing our pre-processing workflow on arbitrary raw Russian text: UDPipe variant and Mystem variant.
Some strong two-word collocations (bigrams) were merged into 1 token with the special character "::" (especially common for proper names), for instance, борис::гребенщиков_PROPN. For comparison, you can also download the model with no bigrams merged (ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018) and the model in which all the bigrams belonging to productive types were merged, independent of their frequencies (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).
Evaluating performance of distributional semantic models is a difficult problem in itself. The best way is arguably to test the models on the downstream tasks for which they are going to be eventually used (extrinsic evaluation). However, if one trains a general-purpose model (like the ones presented at this service), some intrinsic evaluation methods shoud be used, testing the general ability of the models to capture natural language regularities.
We evaluated our models using two well-known approaches:
Boris Orekhov, associate professor at NRU HSE, created "vector rephrasings" of classic Russian literature works.