Models

All the models can be downloaded and used for free, under the CC-BY license (models in bold are available for queries via our web interface and API).

Contextualized embeddings

Persistent identifier	Download	Algorithm	Corpus	Corpus size	Tagset	Vector size	RUSSE'18	ParaPhraser	Creation date
araneum_lemmas_elmo_2048_2020	1.5 Gbytes	ELMo	Araneum (lemmas)	about 10 billion words	None	2048	0.91	0.56	October 2020
tayga_lemmas_elmo_2048_2019	1.7 Gbytes	ELMo	Taiga (lemmas)	near 5 billion words	None	2048	0.93	0.54	December 2019
ruwikiruscorpora_tokens_elmo_1024_2019	197 Mbytes	ELMo	RNC and Wikipedia December 2018 (tokens)	989 million words	None	1024	0.88	0.55	August 2019
ruwikiruscorpora_lemmas_elmo_1024_2019	197 Mbytes	ELMo	RNC and Wikipedia December 2018 (lemmas)	989 million words	None	1024	0.91	0.57	August 2019

Static embeddings

The table can (and should) be scrolled horizontally!

Persistent identifier	Download	Corpus	Corpus size	Vocabulary size	Frequency threshold	Tagset	Algorithm	Vector size	Window size	SimLex965	Google Analogies	Creation date
ukrconll_upos_cbow_200_10_2022	170 Mbytes	Ukrainian Wikipedia and CommonCrawl, 2017	300 million words	99 884	5 (max vocab size 100К)	Universal Tags	Continuous Bag-of-Words	200	10			July 2022
ruwikiruscorpora_upos_cbow_300_10_2021	638 Mbytes	RNC and Wikipedia November 2021	1.2 billion words	249 333	5 (max vocab size 250К)	Universal Tags	Continuous Bag-of-Words	300	10	0.30	0.70	December 2021
geowac_lemmas_none_fasttextskipgram_300_5_2020	1.4 Gbytes	GeoWAC	2.1 billion words	154 923	150	Нет	fastText Skipgram (3..5-grams)	300	5	0.37	0.69	October 2020
geowac_tokens_none_fasttextskipgram_300_5_2020	1.8 Gbytes	GeoWAC	2.1 billion words	347 295	150	Нет	fastText Skipgram (3..5-grams)	300	5	0.33	0.58	October 2020
ruscorpora_upos_cbow_300_20_2019	462 Mbytes	RNC	270 million words	189 193	5 (max vocab size 250K)	Universal Tags	Continuous Bag-of-Words	300	20	0.36	0.60	January 2019
ruwikiruscorpora_upos_skipgram_300_2_2019	608 Mbytes	RNC and Wikipedia December 2018	788 million words	248 978	5 (max vocab size 250K)	Universal Tags	Continuous Skipgram	300	2	0.32	0.72	January 2019
tayga_upos_skipgram_300_2_2019	610 Mbytes	Taiga	near 5 billion words	249 565	5 (max vocab size 250K)	Universal Tags	Continuous Skipgram	300	2	0.42	0.58	January 2019
tayga_none_fasttextcbow_300_10_2019	2.6 Gbytes	Taiga	near 5 billion words	192 415	5 (max vocab size 250K)	None	fastText CBOW (3..5-grams)	300	10	0.37	0.71	January 2019
news_upos_skipgram_300_5_2019	611 Mbytes	Russian News texts	2.6 billion words	249 318	5 (max vocab size 250K)	Universal Tags	Continuous Skipgram	300	5	0.22	0.51	January 2019
araneum_none_fasttextcbow_300_5_2018	2.6 Gbytes	Araneum	about 10 billion words	195 782	400	None	fastText CBOW (3...5-grams)	300	5	0.35	0.73	March 2018
ruscorpora_none_fasttextskipgram_300_2_2019	2.4 Gbytes	RNC	270 million words	164 996	5 (max vocab size 250K)	None	fastText Skipgram (3..5-grams)	300	2	0.39	0.63	January 2019
ruwikiruscorpora-func_upos_skipgram_300_5_2019	606 Mbytes	RNC and Wikipedia December 2018 (including functional words)	788 million words	248 118	5 (max vocab size 250K)	Universal Tags	Continuous Skipgram	300	5	0.32	0.71	January 2019
tayga-func_upos_skipgram_300_5_2019	611 Mbytes	Taiga (including functional words)	near 5 billion words	249 946	5 (max vocab size 250K)	Universal Tags	Continuous Skipgram	300	5	0.41	0.57	January 2019
taiga_upos_skipgram_300_2_2018	331 Mbytes	Taiga	near 5 billion words	237 255	200	Universal Tags	Continuous Skipgram	300	2	0.41	0.53	June 2018
ruscorpora_upos_skipgram_300_5_2018	191 Mbytes	RNC	250 million words	195 071	20	Universal Tags	Continuous Skipgram	300	5	0.37	0.58	January 2018
ruwikiruscorpora_upos_skipgram_300_2_2018	376 Mbytes	RNC and Wikipedia December 2017	600 million words	384 764	40	Universal Tags	Continuous Skipgram	300	2	0.34	0.70	January 2018
news_upos_cbow_600_2_2018	547 Mbytes	Russian news, from September 2013 until November 2016	near 5 billion words	289 191	200	Universal Tags	Continuous Bag-of-Words	600	2	0.35	0.46	January 2018
araneum_upos_skipgram_300_2_2018	192 Mbytes	Araneum	about 10 billion words	196 620	400	Universal Tags	Continuous Skipgram	300	2	0.38	0.65	January 2018
araneum_none_fasttextskipgram_300_5_2018	2.5 Gbytes	Araneum	about 10 billion words	195 782	400	None	fastText Skipgram (3-grams)	300	5	0.36	0.71	January 2018
ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018	385 Mbytes	RNC and Wikipedia December 2017 (without merging bigrams)	600 million words	394 332	40	Universal Tags	Continuous Skipgram	300	5	0.31	0.72	January 2018
ruwikiruscorpora-superbigrams_skipgram_300_2_2018	735 Mbytes	RNC and Wikipedia December 2017 (unlimited bigram merging)	600 million words	746 695	40	Universal Tags	Continuous Skipgram	300	2	0.28	0.71	January 2018
ruscorpora_upos_skipgram_300_10_2017	200 Mbytes	RNC	250 million words	184 973	10	Universal Tags	Continuous Skipgram	300	10	0.35	0.65	January 2017
ruwikiruscorpora_upos_cbow_300_20_2017	420 Mbytes	RNC and Wikipedia November 2016	600 million words	392 339	15	Universal Tags	Continuous Bag-of-Words	300	20	0.33	0.70	January 2017
web_upos_cbow_300_20_2017	290 Mbytes	Web corpus December 2014	900 million words	267 540	30	Universal Tags	Continuous Bag-of-Words	300	20	0.32	0.67	January 2017
news_upos_cbow_300_2_2017	130 Mbytes	Russian news, from September 2013 until November 2016	near 5 billion words	194 058	200	Universal Tags	Continuous Bag-of-Words	300	2	0.33	0.42	February 2017
araneum_upos_skipgram_600_2_2017	419 Mbytes	Araneum	about 10 billion words	196 465	400	Universal Tags	Continuous Skipgram	600	2	0.41	0.59	June 2017
ruscorpora_upos_skipgram_600_10_2017	371 Mbytes	RNC	250 million words	173 816	10	Universal Tags	Continuous Skipgram	600	2	0.43	0.29	June 2017
ruscorpora_mystem_cbow_300_2_2015	303 Mbytes	RNC	107 million words	281 776	3	Mystem	Continuous Bag-of-Words	300	2	0.38	0.27	December 2015
ruwikiruscorpora_mystem_cbow_500_2_2015	1100 Mbytes	RNC and Wikipedia 2015	280 million words	604 043	5	Mystem	Continuous Bag-of-Words	500	2	0.38	0.38	March 2015
web_mystem_skipgram_500_2_2015	630 Mbytes	Web corpus, December 2014	660 million words	353 608	30	Mystem	Continuous Skipgram	500	2	0.31	0.52	November 2015
news_mystem_skipgram_1000_20_2015	525 Mbytes	Russian news, from September 2013 to October 2015	2.5 billion words	147 358	200	Mystem	Continuous Skip-Gram	1000	20	0.32	0.58	December 2015

Corpora

RNC: full Russian National Corpus;
Wikipedia: Russian Wikipedia dump for the corresponding date;
News: news stream from 1 500 primarily Russian-language news sites (about 30 million documents in total in the recent models);
Araneum Russicum Maximum: large web corpus of Russian compiled by Vladimir Benko in 2016 (download lemmatized and tagged corpus);
Taiga: open and structured Russian web corpus with morphological and syntax annotation (poetry subcorpus excluded); compiled by Tatyana Shavrina;
GeoWAC: a sample of documents in Russian from CommonCrawl dump, geographically balanced; compiled by Jonathan Dunn and Ben Adams;
Web: random Russian web pages crawled in December 2014, 9 million documents in total.

Corpus preprocessing

Prior to training, all the corpora were tokenized, split into sentences, lemmatized (except for the models containing "_tokens_" in their identifiers) and PoS-tagged using UDPipe (except for fastText and ELMo models). PoS tags comply to the Universal PoS Tags format (for instance, "печь_NOUN"). The table of conversion from the Mystem tags to the Universal tags is available here. Stop words (punctuation, conjunctions, pronouns, prepositions, particles, etc) were removed, except for the models containing "_tokens_" or "_func" in their identifiers.

Corpus preprocessing is described in detail in our tutorial (in Russian). You can also use ready-made Python scripts reproducing our pre-processing workflow on arbitrary raw Russian text: UDPipe variant and Mystem variant.

Some strong two-word collocations (bigrams) were merged into 1 token with the special character "::" (especially common for proper names), for instance, борис::гребенщиков_PROPN. For comparison, you can also download the model with no bigrams merged (ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018) and the model in which all the bigrams belonging to productive types were merged, independent of their frequencies (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).

Vectors in the models starting from 2019 are not unit-normed. ELMo models are trained on corpora with minimalist pre-processing: tokenization and (optionally) lemmatization using UDPipe. No stop words were removed, PoS tags were not appended to words.

To work with ELMo models locally, we recommend our easy to use Python library.

Evaluation

Evaluating performance of distributional semantic models is a difficult problem in itself. The best way is arguably to test the models on the downstream tasks for which they are going to be eventually used (extrinsic evaluation). However, if one trains a general-purpose model (like the ones presented at this service), some intrinsic evaluation methods shoud be used, testing the general ability of the models to capture natural language regularities.

We evaluated our models using two well-known approaches:

Spearman correlation of model-produced pairwise similarities between words against a manually annotated test set. We used the RuSimLex965 test set based on the Multilingual SimLex999 dataset (our paper about RuSimLex965).
Accuracy in solving analogy problems. To test it, we employed the Russian adaptation of Google Analogies Dataset semantic part (translated by Tatyana Kononova; read more about the analogy task).

ELMo models are evaluated on the word sense disambiguation (WSD) task, using Human-Annotated Sense-Disambiguated Word Contexts for Russian test set (a.k.a. RUSSE'18). The final score is macro-F1 across all the test set words. Read more about evaluating ELMo models in our paper. Additionally, we use document pair classification task, with the ParaPhraser project corpus as the golden data source.

Download all our test sets (raw or PoS-tagged).

Who uses the RusVectōrēs models?

drafterleo developed a poetic search engine.
Boris Orekhov, associate professor at NRU HSE, created "vector rephrasings" of classic Russian literature works.
opennota developed a semantic search engine over Pushkin's poems.
Sberbank AI employs our models as baseline in their ClassicAI, a shared task on poetry generation.
Mikhail Kopotev's group in the Helsinki University automatically extracts constructions based on semantic similarity in the CoCoCo: Colligations, Collocations & Constructions project.
Olga Mitrofanova's group at the Saint-Petersburg State University works with the distributional semantic calculator to predict Russian collocations коллокаций with a given set of features.
...

Have something to add to this list? Contact us!

RusVectōrēs Team