Models

All the models can be downloaded and used for free, under the CC-BY license (models in bold are available for queries via our web interface and API).

Contextualized embeddings

Persistent identifier Download Algorithm Corpus Corpus size Tagset Vector size RUSSE'18 ParaPhraser Creation date
araneum_lemmas_elmo_2048_2020 1.5 Gbytes ELMo Araneum (lemmas) about 10 billion words None 2048 0.91 0.56 October 2020
tayga_lemmas_elmo_2048_2019 1.7 Gbytes ELMo Taiga (lemmas) near 5 billion words None 2048 0.93 0.54 December 2019
ruwikiruscorpora_tokens_elmo_1024_2019 197 Mbytes ELMo RNC and Wikipedia December 2018 (tokens) 989 million words None 1024 0.88 0.55 August 2019
ruwikiruscorpora_lemmas_elmo_1024_2019 197 Mbytes ELMo RNC and Wikipedia December 2018 (lemmas) 989 million words None 1024 0.91 0.57 August 2019

Static embeddings

The table can (and should) be scrolled horizontally!
Persistent identifier Download Corpus Corpus size Vocabulary size Frequency threshold Tagset Algorithm Vector size Window size SimLex965 Google Analogies Creation date
ukrconll_upos_cbow_200_10_2022 170 Mbytes Ukrainian Wikipedia and CommonCrawl, 2017 300 million words 99 884 5 (max vocab size 100К) Universal Tags Continuous Bag-of-Words 200 10 July 2022
ruwikiruscorpora_upos_cbow_300_10_2021 638 Mbytes RNC and Wikipedia November 2021 1.2 billion words 249 333 5 (max vocab size 250К) Universal Tags Continuous Bag-of-Words 300 10 0.30 0.70 December 2021
geowac_lemmas_none_fasttextskipgram_300_5_2020 1.4 Gbytes GeoWAC 2.1 billion words 154 923 150 Нет fastText Skipgram (3..5-grams) 300 5 0.37 0.69 October 2020
geowac_tokens_none_fasttextskipgram_300_5_2020 1.8 Gbytes GeoWAC 2.1 billion words 347 295 150 Нет fastText Skipgram (3..5-grams) 300 5 0.33 0.58 October 2020
ruscorpora_upos_cbow_300_20_2019 462 Mbytes RNC 270 million words 189 193 5 (max vocab size 250K) Universal Tags Continuous Bag-of-Words 300 20 0.36 0.60 January 2019
ruwikiruscorpora_upos_skipgram_300_2_2019 608 Mbytes RNC and Wikipedia December 2018 788 million words 248 978 5 (max vocab size 250K) Universal Tags Continuous Skipgram 300 2 0.32 0.72 January 2019
tayga_upos_skipgram_300_2_2019 610 Mbytes Taiga near 5 billion words 249 565 5 (max vocab size 250K) Universal Tags Continuous Skipgram 300 2 0.42 0.58 January 2019
tayga_none_fasttextcbow_300_10_2019 2.6 Gbytes Taiga near 5 billion words 192 415 5 (max vocab size 250K) None fastText CBOW (3..5-grams) 300 10 0.37 0.71 January 2019
news_upos_skipgram_300_5_2019 611 Mbytes Russian News texts 2.6 billion words 249 318 5 (max vocab size 250K) Universal Tags Continuous Skipgram 300 5 0.22 0.51 January 2019
araneum_none_fasttextcbow_300_5_2018 2.6 Gbytes Araneum about 10 billion words 195 782 400 None fastText CBOW (3...5-grams) 300 5 0.35 0.73 March 2018
ruscorpora_none_fasttextskipgram_300_2_2019 2.4 Gbytes RNC 270 million words 164 996 5 (max vocab size 250K) None fastText Skipgram (3..5-grams) 300 2 0.39 0.63 January 2019
ruwikiruscorpora-func_upos_skipgram_300_5_2019 606 Mbytes RNC and Wikipedia December 2018 (including functional words) 788 million words 248 118 5 (max vocab size 250K) Universal Tags Continuous Skipgram 300 5 0.32 0.71 January 2019
tayga-func_upos_skipgram_300_5_2019 611 Mbytes Taiga (including functional words) near 5 billion words 249 946 5 (max vocab size 250K) Universal Tags Continuous Skipgram 300 5 0.41 0.57 January 2019
taiga_upos_skipgram_300_2_2018 331 Mbytes Taiga near 5 billion words 237 255 200 Universal Tags Continuous Skipgram 300 2 0.41 0.53 June 2018
ruscorpora_upos_skipgram_300_5_2018 191 Mbytes RNC 250 million words 195 071 20 Universal Tags Continuous Skipgram 300 5 0.37 0.58 January 2018
ruwikiruscorpora_upos_skipgram_300_2_2018 376 Mbytes RNC and  Wikipedia  December 2017 600 million words 384 764 40 Universal Tags Continuous Skipgram 300 2 0.34 0.70 January 2018
news_upos_cbow_600_2_2018 547 Mbytes Russian news, from September 2013 until November 2016 near 5 billion words 289 191 200 Universal Tags Continuous Bag-of-Words 600 2 0.35 0.46 January 2018
araneum_upos_skipgram_300_2_2018 192 Mbytes Araneum about 10 billion words 196 620 400 Universal Tags Continuous Skipgram 300 2 0.38 0.65 January 2018
araneum_none_fasttextskipgram_300_5_2018 2.5 Gbytes Araneum about 10 billion words 195 782 400 None fastText Skipgram (3-grams) 300 5 0.36 0.71 January 2018
ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018 385 Mbytes RNC and  Wikipedia  December 2017 (without merging bigrams) 600 million words 394 332 40 Universal Tags Continuous Skipgram 300 5 0.31 0.72 January 2018
ruwikiruscorpora-superbigrams_skipgram_300_2_2018 735 Mbytes RNC and  Wikipedia  December 2017 (unlimited bigram merging) 600 million words 746 695 40 Universal Tags Continuous Skipgram 300 2 0.28 0.71 January 2018
ruscorpora_upos_skipgram_300_10_2017 200 Mbytes RNC 250 million words 184 973 10 Universal Tags Continuous Skipgram 300 10 0.35 0.65 January 2017
ruwikiruscorpora_upos_cbow_300_20_2017 420 Mbytes RNC and  Wikipedia  November 2016 600 million words 392 339 15 Universal Tags Continuous Bag-of-Words 300 20 0.33 0.70 January 2017
web_upos_cbow_300_20_2017 290 Mbytes Web corpus December 2014 900 million words 267 540 30 Universal Tags Continuous Bag-of-Words 300 20 0.32 0.67 January 2017
news_upos_cbow_300_2_2017 130 Mbytes Russian news, from September 2013 until November 2016 near 5 billion words 194 058 200 Universal Tags Continuous Bag-of-Words 300 2 0.33 0.42 February 2017
araneum_upos_skipgram_600_2_2017 419 Mbytes Araneum about 10 billion words 196 465 400 Universal Tags Continuous Skipgram 600 2 0.41 0.59 June 2017
ruscorpora_upos_skipgram_600_10_2017 371 Mbytes RNC 250 million words 173 816 10 Universal Tags Continuous Skipgram 600 2 0.43 0.29 June 2017
ruscorpora_mystem_cbow_300_2_2015 303 Mbytes RNC 107 million words 281 776 3 Mystem Continuous Bag-of-Words 300 2 0.38 0.27 December 2015
ruwikiruscorpora_mystem_cbow_500_2_2015 1100 Mbytes RNC and Wikipedia  2015 280 million words 604 043 5 Mystem Continuous Bag-of-Words 500 2 0.38 0.38 March 2015
web_mystem_skipgram_500_2_2015 630 Mbytes Web corpus, December 2014 660 million words 353 608 30 Mystem Continuous Skipgram 500 2 0.31 0.52 November 2015
news_mystem_skipgram_1000_20_2015 525 Mbytes Russian news, from September 2013 to October 2015 2.5 billion words 147 358 200 Mystem Continuous Skip-Gram 1000 20 0.32 0.58 December 2015

Corpora

  1. RNC: full Russian National Corpus;
  2. Wikipedia: Russian Wikipedia dump for the corresponding date;
  3. News: news stream from 1 500 primarily Russian-language news sites (about 30 million documents in total in the recent models);
  4. Araneum Russicum Maximum: large web corpus of Russian compiled by Vladimir Benko in 2016 (download lemmatized and tagged corpus);
  5. Taiga: open and structured Russian web corpus with morphological and syntax annotation (poetry subcorpus excluded); compiled by Tatyana Shavrina;
  6. GeoWAC: a sample of documents in Russian from CommonCrawl dump, geographically balanced; compiled by Jonathan Dunn and Ben Adams;
  7. Web: random Russian web pages crawled in December 2014, 9 million documents in total.

Corpus preprocessing

Prior to training, all the corpora were tokenized, split into sentences, lemmatized (except for the models containing "_tokens_" in their identifiers) and PoS-tagged using UDPipe (except for fastText and ELMo models). PoS tags comply to the Universal PoS Tags format (for instance, "печь_NOUN"). The table of conversion from the Mystem tags to the Universal tags is available here. Stop words (punctuation, conjunctions, pronouns, prepositions, particles, etc) were removed, except for the models containing "_tokens_" or "_func" in their identifiers.

Corpus preprocessing is described in detail in our tutorial (in Russian). You can also use ready-made Python scripts reproducing our pre-processing workflow on arbitrary raw Russian text: UDPipe variant and Mystem variant.

Some strong two-word collocations (bigrams) were merged into 1 token with the special character "::" (especially common for proper names), for instance, борис::гребенщиков_PROPN. For comparison, you can also download the model with no bigrams merged (ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018) and the model in which all the bigrams belonging to productive types were merged, independent of their frequencies (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).

Vectors in the models starting from 2019 are not unit-normed. ELMo models are trained on corpora with minimalist pre-processing: tokenization and (optionally) lemmatization using UDPipe. No stop words were removed, PoS tags were not appended to words.

To work with ELMo models locally, we recommend our easy to use Python library.

Evaluation

Evaluating performance of distributional semantic models is a difficult problem in itself. The best way is arguably to test the models on the downstream tasks for which they are going to be eventually used (extrinsic evaluation). However, if one trains a general-purpose model (like the ones presented at this service), some intrinsic evaluation methods shoud be used, testing the general ability of the models to capture natural language regularities.

We evaluated our models using two well-known approaches:

  1. Spearman correlation of model-produced pairwise similarities between words against a manually annotated test set. We used the RuSimLex965 test set based on the Multilingual SimLex999 dataset (our paper about RuSimLex965).
  2. Accuracy in solving analogy problems. To test it, we employed the Russian adaptation of Google Analogies Dataset semantic part (translated by Tatyana Kononova; read more about the analogy task).

ELMo models are evaluated on the word sense disambiguation (WSD) task, using Human-Annotated Sense-Disambiguated Word Contexts for Russian test set (a.k.a. RUSSE'18). The final score is macro-F1 across all the test set words. Read more about evaluating ELMo models in our paper. Additionally, we use document pair classification task, with the ParaPhraser project corpus as the golden data source.

Download all our test sets (raw or PoS-tagged).

Who uses the RusVectōrēs models?

  1. drafterleo developed a poetic search engine.

  2. Boris Orekhov, associate professor at NRU HSE, created "vector rephrasings" of classic Russian literature works.

  3. opennota developed a semantic search engine over Pushkin's poems.

  4. Sberbank AI employs our models as baseline in their ClassicAI, a shared task on poetry generation.

  5. Mikhail Kopotev's group in the Helsinki University automatically extracts constructions based on semantic similarity in the CoCoCo: Colligations, Collocations & Constructions project.

  6. Olga Mitrofanova's group at the Saint-Petersburg State University works with the distributional semantic calculator to predict Russian collocations коллокаций with a given set of features.

  7. ...

Have something to add to this list? Contact us!