RusVectōrēs service computes semantic relations between words in Russian and provides pre-trained distributional semantic models (word embeddings), including contextualized ones. It is named after RusCorpora, the site for the Russian National Corpus. They provide access to corpora, we provide access to semantic vectors (vectōrēs in Latin). These vectors reflect meaning based on word co-occurrence distribution in the training corpora (huge amounts of raw linguistic data).

In distributional semantics, words are usually represented as vectors in a multi-dimensional space of their contexts. Semantic similarity between two words is then calculated as a cosine similarity between their corresponding vectors; it takes values between -1 and 1 (usually only values above 0 are used in practical tasks). 0 value roughly means the words lack similar contexts, and thus their meanings are unrelated to each other. 1 value means that the words' contexts are absolutely identical, and thus their meaning is very similar.

Distributional semantics is under the hood of almost all contemporary natural language understanding systems. As a rule, the so called predictive models are employed, which learn hiqh-quality dense vectors representing word meaning (embeddings). These models are often trained using shallow artifical neural networks based on the task of predicting the next word in a sequence: language modeling. One of the first and arguably the most well-known tool in this field now is word2vec, but new models and algorithms are published regularly.

Unfortunately, training word embedding models on large corpora can be computationally expensive. That's why it is important to provide access to pre-trained models to the Russian linguistic community. We feature ready-made models trained on several Russian corpora, and a convenient web interface to query them. You can also download the models to process them on your own. Moreover, our web service features a bunch of (hopefully) useful visualizations of semantic relations between words. In general, the reason behind RusVectōrēs is to lower the entry threshold for those who want to work in this new and exciting field.

What RusVectōrēs can do?

RusVectōrēs is basically a tool to explore relations between words in distributional models. You can think about it as a kind of `semantic calculator'. A user can choose one or several carefully prepared models trained on different corpora

After choosing a model, it is possible to:

  1. calculate semantic similarity between pairs of words;
  2. find words semantically closest to the query word (optionally with part-of-speech and frequency filters);
  3. perform analogical inference: find a word X which is related to the word Y in the same way as the word A is related to the word B;
  4. apply simple algebraic operations to word vectors (addition, subtraction, finding average vector for a group of words and distances to this average value);
  5. draw semantic maps of relations between input words (it is useful to explore clusters and oppositions, or to test your hypotheses about them);
  6. get the raw vectors (arrays of real values) and their visualizations for words in the chosen model: just click on any word anywhere, or use a direct URI to the word of interest, as described below;
  7. generate context-dependent lexical substitutes from contextualized embedding architectures like ELMo;
  8. Download the model.

In the spirit of Semantic Web, each word in each model has its own unique URI explicitly stating lemma, model and part of speech (for example,алгоритм_NOUN/). Web pages at these URIs contain lists of the nearest semantic associates for the corresponding word, belonging to the same part of speech as the word itself. Other information about the word is also shown.

We also provide a simple API to get the list of semantic associate for a given word in a given model (one of those available via the web interface). There are two possible formats: json and csv. Perform GET requests to URLs following the pattern where MODEL is the identifier for the chosen model, WORD is the query word and FORMAT is "csv" or "json", depending on the output format you need. We will return a json file or a tab-separated text file with the first 10 associates.

Additionally, you can get semantic similarities for word pairs in any of the provided models via queries of the following format: (note 2 underscore signs).

We recommend to experiment with algebraic operations on vectors, as they return interesting results. For example, the model trained on the Russian National Corpus corpus returns быт if we subtract любовь from жизнь.

Naturally, one can compare results from different models on one screen.

Apart from web interface, our service also features a bot in the Telegram messenger. You can ask question to this bot while commuting to your office/university, and it will send a query to the API. This can be convenient in the situations when you want to check upon an idea, but no laptop is available nearby. Also, follow us via our Telegram channel.

We would like RusVectōrēs to become a hub of scholarly knowledge about word embedding models for Russian, that's why there is a section with published academic papers and links to other relevant resources. At the same time, we hope that RusVectōrēs will also popularize distributional semantics and computational linguistics, making it more understandable and attractive to the Russian-speaking public.



Selected papers on distributional semantics

  1. Turney, P. D., P. Pantel. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1), 141-188. (2010)
  2. Mikolov, T., et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
  3. Mikolov, Tomas, et al. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013).
  4. Baroni, Marco, et al. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Vol. 1. (2014)
  5. Pennington, J., et al. Glove: Global Vectors for Word Representation. EMNLP. Vol. 14. 2014.
  6. Le, Quoc., and Mikolov, Tomas. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning (2014).
  7. Kutuzov, Andrey and Kuzmenko, Elizaveta. Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian. A. Gelbukh (Ed.): CICLing 2015, Part I, Springer LNCS 9041, pp. 47–58, 2015. DOI: 10.1007/978-3-319-18111-0_4
  8. Bartunov Sergey et al. Breaking Sticks and Ambiguities with Adaptive Skip-gram. Eprint arXiv:1502.07257, 02/2015
  9. O. Levy, Y. Goldberg, and I. Dagan Improving Distributional Similarity with Lessons Learned from Word Embeddings. TACL 2015
  10. Xin Rong word2vec Parameter Learning Explained. arXiv preprint arXiv:1411.2738 (2015)
  11. Panchenko A., et al. RUSSE: The First Workshop on Russian Semantic Similarity. Proceedings of the Dialogue 2015 conference, Moscow, Russia (2015)
  12. Kutuzov, Andrey and Andreev, Igor. Texts in, meaning out: neural language models in semantic similarity task for Russian. Proceedings of the Dialog 2015 Conference, Moscow, Russia (2015)
  13. Arefyev N.V., et al. Evaluating three corpus-based semantic similarity systems for Russian. Proceedings of the Dialogue 2015 conference, Moscow, Russia (2015)
  14. Lopukhin K.A., et al. The impact of different vector space models and supplementary techniques in Russian semantic similarity task. Proceedings of the Dialogue 2015 conference, Moscow, Russia (2015)
  15. Hamilton, W. L., et al. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016).
  16. Sahlgren, M., and Lenci, A. The Effects of Data Size and Frequency Range on Distributional Semantic Models. Proceedings of EMNLP. (2016)
  17. Bojanowski, P., et al. Enriching Word Vectors with Subword Information. Transactions of the Association of Computational Linguistics – Volume 5, Issue 1 (2017).
  18. Peters, M., et al. Deep Contextualized Word Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2018).

Andrey Kutuzov's talk "Distributional semantic models and their applications" (workshop at the Institute for Systems Analysis of Russian Academy of Sciences, 3 March 2017), in Russian:

Selected publications citing RusVectōrēs

  1. Bolshina, A., Loukachevitch N. Automatic Labelling of Genre-Specific Collections for Word Sense Disambiguation in Russian. Proceedings of 18th Russian Conference on Artificial Intelligence (2020)
  2. Gudkov, V., et al. Russian Prepositional Phrase Semantic Labeling with Word Embedding-Based Classifier. Proceedings of the III International Conference on Language Engineering and Applied Linguistics (2019)
  3. Larionov, D., et al. Semantic Role Labeling with Pretrained Language Models for Known and Unknown Predicates. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
  4. Bogolyubova, O., et al. The Language of Positive Mental Health: Findings from a Sample of Russian Facebook Users. SAGE Open 10, no. 2 (2020)
  5. Loukachevitch, N., and Rusnachenko, N. Distant Supervision for Sentiment Attitude Extraction. Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
  6. Zheng, X., et al. Semantic Role Labeling for Russian Language Based on Ensemble Model. IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (2019)
  7. Mrkšić, N. et al. Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. Transactions of the Association for Computational Linguistics (TACL) (2018)
  8. Bogolyubova, O., et al. Dark Personalities on Facebook: Harmful Online Behaviors and Language. Computers in Human Behavior, Volume 78 (2018)
  9. Antropova, O., et al. Cleaning Up After a Party: Post-processing Thesaurus Crowdsourced Data. Conference on Artificial Intelligence and Natural Language. Springer, Cham (2018)
  10. Loukachevitch, N., Rusnachenko, N. Extracting Sentiment Attitudes from Analytical Texts. Dialogue conference (2018)
  11. Rusnachenko, N., Loukachevitch, N. Sentiment Attitudes and Their Extraction from Analytical Texts. International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining. Springer, Cham (2018)
  12. Enikeeva, E., Popov, A. Developing a Russian Database of Regular Semantic Relations Based on Word Embeddings. The XVIII EURALEX International Congress (2018).
  13. Pronoza, E., et al. Extraction of Typical Client Requests from Bank Chat Logs. Mexican International Conference on Artificial Intelligence. Springer, Cham (2018)
  14. Ermilov, A., et al. Stierlitz Meets SVM: Humor Detection in Russian. Conference on Artificial Intelligence and Natural Language. Springer, Cham (2018)
  15. Badryzlova, Y., Panicheva, P. A Multi-feature Classifier for Verbal Metaphor Identification in Russian Texts. Conference on Artificial Intelligence and Natural Language. Springer, Cham (2018)
  16. Karyaeva, M., et al. Extraction of Hypernyms from Dictionaries with a Little Help from Word Embeddings. International Conference on Analysis of Images, Social Networks and Texts. Springer, Cham (2018)
  17. Sboev, A., et al. Automatic gender identification of author of Russian text by machine learning and neural net algorithms in case of gender deception. Procedia Computer Science 123 (2018)
  18. Panicheva, P., Badryzlova, Yu. Distributional semantic features in Russian verbal metaphor identification. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Volume 1, Issue 16 (2017)
  19. Trofimov, I., Suleymanova, E. A syntax-based distributional model for discriminating between semantic similarity and association. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Volume 1, Issue 16 (2017)
  20. Shelmanov, A., Devyatkin, D. Semantic role labeling with neural networks for texts in Russian. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Volume 1, Issue 16 (2017)
  21. Enikeeva I., Mitrofanova, O. Russian Collocation extraction based on word embeddings. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Volume 1, Issue 16 (2017)
  22. Bolotova, V., et al. Which IR model has a better sense of humor? Search over a large collection of jokes. Komp'juternaja Lingvistika i Intellektual'nye Tehnologii, Volume 1, Issue 16 (2017)
  23. Anh, L., et al. Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition. Conference on Artificial Intelligence and Natural Language. Springer, Cham (2017)
  24. Кузнецов, И. Автоматическая разметка семантических ролей в русском языке. Диссертация на соискание ученой степени кандидата филологических наук (2016) (in Russian)
  25. Кириллов, А. Н., Крижановский. А. А. Модель геометрической структуры синсета. Серия «Математическое моделирование и информационные технологии», Вып. 08, стр. 45-54, 2016 (in Russian)
  26. Kalimoldayev, M., et al. The application of the connectionist method of semantic similarity for Kazakh language. In Electronics Computer and Computation (ICECCO), 2015 Twelve International Conference on (pp. 1-3). IEEE.
  27. Kopotev, M., Pivovarova, L., & Kormacheva, D. Constructional generalization over Russian collocations. Memoires de la Societe neophilologique de Helsinki, 2016
  28. ...

Citing us

If you use RusVectōrēs, please cite this paper:

Kutuzov A., Kuzmenko E. (2017) WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models. In: Ignatov D. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2016. Communications in Computer and Information Science, vol 661. Springer, Cham (pdf, bibtex)