RusVectōrēs 5: new models in 2019

RusVectōrēs, our web service demonstrating distributional semantic models for Russian, continues to develop! RusVectōrēs is a tool to play with distributional semantic models (a.k.a. word embeddings) right in your browser.

With our service, you can try out the models trained on various corpora for the Russian language. You can also download them to work locally on your machine. Based on our models, people implement shared tasks in poetic artifical intelligence, detect toxic online behaviour and attitude sentiment, to name just a few cases.

In January 2019, we released a set on new word2vec and fastText models for Russian. You already can download them (identifiers with the "_2019" postfix), and they are used in our web interface. We updated the models trained on the following corpora:

  1. Russian National Corpus (RNC)

  2. RNC + Russian Wikipedia

  3. Taiga

  4. News corpus

Changelog

  1. All the corpora were processed from scratch ("Taiga" as well), including tokenization, lemmatization and PoS-tagging. In particular, the raw texts were cleaned using a derivative of Tatyana Shavrina filters. As a result, we got rid of many incorrectly tokenized and lemmatized words (this issue arose after RusVectōrēs moved to UDPipe). In a few days we will update the tutorial correspondingly.

  2. To make the models more comparable, during the training we used one and the same max vocabulary size (250 thousand words). It means a model can have less words in its vocabulary, but never more.

  3. The "RNC + Wikipedia" now uses the updated Russian Wikipedia dump from December 2018.

  4. The news corpus became smaller (2.5 billion words), but more diverse: now it contains "Taiga" news segment, RNC newspaper corpus, Russian news pages crawled from the web, and lenta.ru dataset. Unfortunately, this corpus is still not available for public release (but the model trained on it is).

  5. Merging proper names sequences (for example, «Василий_PROPN Иванович_PROPN» is transformed into «Василий::Иванович_PROPN») is made stricter. Now it happens only if the proper names agree in their case and number.

  6. Most RusVectōrēs models lack words belonging to functional parts of speech (prepositions, conjunctions, punctuation, etc.). However, for some NLP tasks (like neural sentiment analysis) it can be useful to have at least some representations for such entities. Thus, one can now download the word embedding models trained on "Taiga" and "RNC + Wikipedia", preserving functional words (tayga-func_upos_skipgram_300_5_2019 and ruwikiruscorpora-func_upos_skipgram_300_5_2019).

  7. fastText models trained on "Taiga" (tayga_none_fasttextcbow_300_10_2019) and RNC (ruscorpora_none_fasttextskipgram_300_2_2019) are published. No more out-of-vocabulary words! :-).

  8. Our previous models stored word vectors unit-normed. In most NLP tasks, this is the most sane form. But in some cases, the original vector magnitudes can still be important: at the same time, it is impossible to transform normed vectors back into the original ones. Thus, starting from January 2019, RusVectōrēs models by default store the vectors not unit-normed. Unit normalization in, for example, Gensim, is as easy as MODEL.init_sims(replace=True).

  9. New models are now delivered to you via the infrastructure of the Nordic Language Processing Laboratory. This will ensure the increased speed and reliability of models' downloading experience, while at the same time exposing our models to new audience. :-) In practice it means that after clicking a link to a model, you download a zip archive named with a unique integer identifier (for example, '180.zip'). The archive always contains the meta.json file, which provides structured and standardized information about the model and its training corpus. word2vec models are stored in the archives in two formats at once: binary model.bin (for fast load in your code) and text-based model.txt (for human analysis). These files can be loaded and processed right from inside the archives, without explicit decompression. fastText models must be first extracted from the zip archive; afterwards, they can be loaded into Gensim with "gensim.models.KeyedVectors.load('model.model')".

  10. In the nearest weeks we expect to release contextualized word embedding models for Russian, based on ELMo.

We have many plans for further improving RusVectōrēs. Subscribe to our Telegram channel and to our RSS feed and stay tuned!

RusVectōrēs team:
Andrey Kutuzov (University of Oslo)
Elizaveta Kuzmenko (University of Trento)