All the models can be downloaded and used for free, under the CC-BY license (models in bold are available for queries via our web interface and API).
|Persistent identifier||Download||Algorithm||Corpus||Corpus size||Tagset||Vector size||RUSSE'18||ParaPhraser||Creation date|
|araneum_lemmas_elmo_2048_2020||1.5 Gbytes||ELMo||Araneum (lemmas)||about 10 billion words||None||2048||0.91||0.56||October 2020|
|tayga_lemmas_elmo_2048_2019||1.7 Gbytes||ELMo||Taiga (lemmas)||near 5 billion words||None||2048||0.93||0.54||December 2019|
|ruwikiruscorpora_tokens_elmo_1024_2019||197 Mbytes||ELMo||RNC and Wikipedia December 2018 (tokens)||989 million words||None||1024||0.88||0.55||August 2019|
|ruwikiruscorpora_lemmas_elmo_1024_2019||197 Mbytes||ELMo||RNC and Wikipedia December 2018 (lemmas)||989 million words||None||1024||0.91||0.57||August 2019|
|Persistent identifier||Download||Corpus||Corpus size||Vocabulary size||Frequency threshold||Tagset||Algorithm||Vector size||Window size||SimLex965||Google Analogies||Creation date|
|ruwikiruscorpora_upos_cbow_300_10_2021||638 Mbytes||RNC and Wikipedia November 2021||1.2 billion words||249 333||5 (max vocab size 250К)||Universal Tags||Continuous Bag-of-Words||300||10||0.30||0.70||December 2021|
|geowac_lemmas_none_fasttextskipgram_300_5_2020||1.4 Gbytes||GeoWAC||2.1 billion words||154 923||150||Нет||fastText Skipgram (3..5-grams)||300||5||0.37||0.69||October 2020|
|geowac_tokens_none_fasttextskipgram_300_5_2020||1.8 Gbytes||GeoWAC||2.1 billion words||347 295||150||Нет||fastText Skipgram (3..5-grams)||300||5||0.33||0.58||October 2020|
|ruscorpora_upos_cbow_300_20_2019||462 Mbytes||RNC||270 million words||189 193||5 (max vocab size 250K)||Universal Tags||Continuous Bag-of-Words||300||20||0.36||0.60||January 2019|
|ruwikiruscorpora_upos_skipgram_300_2_2019||608 Mbytes||RNC and Wikipedia December 2018||788 million words||248 978||5 (max vocab size 250K)||Universal Tags||Continuous Skipgram||300||2||0.32||0.72||January 2019|
|tayga_upos_skipgram_300_2_2019||610 Mbytes||Taiga||near 5 billion words||249 565||5 (max vocab size 250K)||Universal Tags||Continuous Skipgram||300||2||0.42||0.58||January 2019|
|tayga_none_fasttextcbow_300_10_2019||2.6 Gbytes||Taiga||near 5 billion words||192 415||5 (max vocab size 250K)||None||fastText CBOW (3..5-grams)||300||10||0.37||0.71||January 2019|
|news_upos_skipgram_300_5_2019||611 Mbytes||Russian News texts||2.6 billion words||249 318||5 (max vocab size 250K)||Universal Tags||Continuous Skipgram||300||5||0.22||0.51||January 2019|
|araneum_none_fasttextcbow_300_5_2018||2.6 Gbytes||Araneum||about 10 billion words||195 782||400||None||fastText CBOW (3...5-grams)||300||5||0.35||0.73||March 2018|
|ruscorpora_none_fasttextskipgram_300_2_2019||2.4 Gbytes||RNC||270 million words||164 996||5 (max vocab size 250K)||None||fastText Skipgram (3..5-grams)||300||2||0.39||0.63||January 2019|
|ruwikiruscorpora-func_upos_skipgram_300_5_2019||606 Mbytes||RNC and Wikipedia December 2018 (including functional words)||788 million words||248 118||5 (max vocab size 250K)||Universal Tags||Continuous Skipgram||300||5||0.32||0.71||January 2019|
|tayga-func_upos_skipgram_300_5_2019||611 Mbytes||Taiga (including functional words)||near 5 billion words||249 946||5 (max vocab size 250K)||Universal Tags||Continuous Skipgram||300||5||0.41||0.57||January 2019|
|taiga_upos_skipgram_300_2_2018||331 Mbytes||Taiga||near 5 billion words||237 255||200||Universal Tags||Continuous Skipgram||300||2||0.41||0.53||June 2018|
|ruscorpora_upos_skipgram_300_5_2018||191 Mbytes||RNC||250 million words||195 071||20||Universal Tags||Continuous Skipgram||300||5||0.37||0.58||January 2018|
|ruwikiruscorpora_upos_skipgram_300_2_2018||376 Mbytes||RNC and Wikipedia December 2017||600 million words||384 764||40||Universal Tags||Continuous Skipgram||300||2||0.34||0.70||January 2018|
|news_upos_cbow_600_2_2018||547 Mbytes||Russian news, from September 2013 until November 2016||near 5 billion words||289 191||200||Universal Tags||Continuous Bag-of-Words||600||2||0.35||0.46||January 2018|
|araneum_upos_skipgram_300_2_2018||192 Mbytes||Araneum||about 10 billion words||196 620||400||Universal Tags||Continuous Skipgram||300||2||0.38||0.65||January 2018|
|araneum_none_fasttextskipgram_300_5_2018||2.5 Gbytes||Araneum||about 10 billion words||195 782||400||None||fastText Skipgram (3-grams)||300||5||0.36||0.71||January 2018|
|ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018||385 Mbytes||RNC and Wikipedia December 2017 (without merging bigrams)||600 million words||394 332||40||Universal Tags||Continuous Skipgram||300||5||0.31||0.72||January 2018|
|ruwikiruscorpora-superbigrams_skipgram_300_2_2018||735 Mbytes||RNC and Wikipedia December 2017 (unlimited bigram merging)||600 million words||746 695||40||Universal Tags||Continuous Skipgram||300||2||0.28||0.71||January 2018|
|ruscorpora_upos_skipgram_300_10_2017||200 Mbytes||RNC||250 million words||184 973||10||Universal Tags||Continuous Skipgram||300||10||0.35||0.65||January 2017|
|ruwikiruscorpora_upos_cbow_300_20_2017||420 Mbytes||RNC and Wikipedia November 2016||600 million words||392 339||15||Universal Tags||Continuous Bag-of-Words||300||20||0.33||0.70||January 2017|
|web_upos_cbow_300_20_2017||290 Mbytes||Web corpus December 2014||900 million words||267 540||30||Universal Tags||Continuous Bag-of-Words||300||20||0.32||0.67||January 2017|
|news_upos_cbow_300_2_2017||130 Mbytes||Russian news, from September 2013 until November 2016||near 5 billion words||194 058||200||Universal Tags||Continuous Bag-of-Words||300||2||0.33||0.42||February 2017|
|araneum_upos_skipgram_600_2_2017||419 Mbytes||Araneum||about 10 billion words||196 465||400||Universal Tags||Continuous Skipgram||600||2||0.41||0.59||June 2017|
|ruscorpora_upos_skipgram_600_10_2017||371 Mbytes||RNC||250 million words||173 816||10||Universal Tags||Continuous Skipgram||600||2||0.43||0.29||June 2017|
|ruscorpora_mystem_cbow_300_2_2015||303 Mbytes||RNC||107 million words||281 776||3||Mystem||Continuous Bag-of-Words||300||2||0.38||0.27||December 2015|
|ruwikiruscorpora_mystem_cbow_500_2_2015||1100 Mbytes||RNC and Wikipedia 2015||280 million words||604 043||5||Mystem||Continuous Bag-of-Words||500||2||0.38||0.38||March 2015|
|web_mystem_skipgram_500_2_2015||630 Mbytes||Web corpus, December 2014||660 million words||353 608||30||Mystem||Continuous Skipgram||500||2||0.31||0.52||November 2015|
|news_mystem_skipgram_1000_20_2015||525 Mbytes||Russian news, from September 2013 to October 2015||2.5 billion words||147 358||200||Mystem||Continuous Skip-Gram||1000||20||0.32||0.58||December 2015|
Prior to training, all the corpora were tokenized, split into sentences, lemmatized (except for the models containing "_tokens_" in their identifiers) and PoS-tagged using UDPipe (except for fastText and ELMo models). PoS tags comply to the Universal PoS Tags format (for instance, "печь_NOUN"). The table of conversion from the Mystem tags to the Universal tags is available here. Stop words (punctuation, conjunctions, pronouns, prepositions, particles, etc) were removed, except for the models containing "_tokens_" or "_func" in their identifiers.
Corpus preprocessing is described in detail in our tutorial (in Russian). You can also use ready-made Python scripts reproducing our pre-processing workflow on arbitrary raw Russian text: UDPipe variant and Mystem variant.
Some strong two-word collocations (bigrams) were merged into 1 token with the special character "::" (especially common for proper names), for instance, борис::гребенщиков_PROPN. For comparison, you can also download the model with no bigrams merged (ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018) and the model in which all the bigrams belonging to productive types were merged, independent of their frequencies (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).
Vectors in the models starting from 2019 are not unit-normed. ELMo models are trained on corpora with minimalist pre-processing: tokenization and (optionally) lemmatization using UDPipe. No stop words were removed, PoS tags were not appended to words.
To work with ELMo models locally, we recommend our easy to use Python library.
Evaluating performance of distributional semantic models is a difficult problem in itself. The best way is arguably to test the models on the downstream tasks for which they are going to be eventually used (extrinsic evaluation). However, if one trains a general-purpose model (like the ones presented at this service), some intrinsic evaluation methods shoud be used, testing the general ability of the models to capture natural language regularities.
We evaluated our models using two well-known approaches:
ELMo models are evaluated on the word sense disambiguation (WSD) task, using Human-Annotated Sense-Disambiguated Word Contexts for Russian test set (a.k.a. RUSSE'18). The final score is macro-F1 across all the test set words. Read more about evaluating ELMo models in our paper. Additionally, we use document pair classification task, with the ParaPhraser project corpus as the golden data source.Download all our test sets (raw or PoS-tagged).
Boris Orekhov, associate professor at NRU HSE, created "vector rephrasings" of classic Russian literature works.