RusVectōrēs 4: summing up the year 2017

Andrey ZaliznyakRusVectōrēs, our web service demonstrating distributional semantic models for Russian, continues to develop! We are releasing RusVectōrēs 4, and happy to share the news with you! This release serves as a tribute to Andrey Zaliznyak: without him, nothing of this would have been possible.

As you may remember, RusVectōrēs is a tool to play with distributional semantic models (a.k.a. word embeddings) right in your browser. Frameworks and algorithms for training word embedding models (word2vec, GloVe, fasttext and others) have revolutionized NLP. These approaches enable computers to learn the actual meaning of words based on their co-occurrence statistics calculated on large text corpora.

With our service, you can try out the models trained on various corpora for the Russian language. You can also download them to work locally on your machine. Based on our models, people implement «vector rephrasings» of classic Russian literature works and semantic search engines over Pushkin's poems... the sky is the limit!

Well, here's the RusVectōrēs 4 news:

  1. Until recently, we lemmatized and PoS-tagged our training corpora using Mystem. Now we moved on to UDPipe. This allowed us to straightforwardly receive tags corresponding to the Universal PoS Tags standard. Additionally, it means our models now feature words with the PROPN PoS tag (proper name), allowing even more fine-tuned queries.
  2. All the models are now named by the common convention, allowing to quickly determine the model's hyperparameters. Fom instance, ruwikiruscorpora_upos_skipgram_300_2_2018 means «the model trained on the concatenation of the Russian National Corpus and Russian Wikipedia, with the Universal PoS tags, trained using word2vec Continuous Skipgram algorithm with vector size 300 and window size2, the model was trained in 2018».
  3. For the first time we present a model trained using the fastText algorithm, with character trigrams (on the Araneum Maximum corpus). This model employs subword information, and even for an out-of-vocavbulary word it can infer its embedding based on its morphemes. This model is trained on lemmas without PoS tags.
  4. For all other models (featuring PoS tags), RusVectōrēs now saves user's choice of part-of-speech filters for the search query.
  5. Calculating semantic similarity between word pairs moved to the Miscellaneous tab. Additionally, this operation now features query history: you can experiment with word similarities, at the same time looking at the previous values.
  6. In all the tabs, word color now reflects word frequencies in the corpora. Words in green are high frequent, while words in red are low-frequent. In many cases, this can be important when looking for the nearest semantic neighbors of a word.
  7. Most of our models are trained on the corpora, where proper names immediately following each other are merged (for instance, «Василий_PROPN Иванович_PROPN» becomes «Василий::Иванович_PROPN»). Two-word collocations with very high pointwise mutual information (PMI) values are also merged. However, we understand that for some applications this merges can be redundant. That's why we now additionally offer a model trained on the RNC and Wikipedia without bigram merging(ruwikiruscorpora-nobigrams_upos_skipgram_300_5_2018). And vice versa, for those who want more bigrams, there is a «super-bigram» model. In it, before training, all the possible word pairs of productive types were merged together (ruwikiruscorpora-superbigrams_skipgram_300_2_2018).
  8. Our WebVectors framework, on which RusVectōrēs is based, was seriously refactored:
    • It now fully supports both Python 2 and Python 3. In the future, we plan to remain compatible with both Python versions.
    • Data exchange between frontend and backend massively rewritten and upgraded.
    • Installation and configuration of the frameworks is now much easier; all the settings are now in the configuration files.
    • WebVectors manual is rewritten taking into account all the recent developments.
  9. In the recent version of the well-known Gensim distributional semantics library, you can now use one single command to load and use one of our RNC (ruscorpora_upos_skipgram_300_10_2017). More on this here. In the nearest future, we will add support for loading other our models.
  10. Many small improvements and fixes.

We We have many plans for improving RusVectōrēs. Subscribe to our RSS feed and stay tuned!

RusVectōrēs team:
Andrey Kutuzov (University of Oslo)
Elizaveta Kuzmenko (University of Groningen)