RusVectōrēs: word embeddings for Russian online
’You shall know a word by the company it keeps.’ (Firth 1957)
Enter a word to produce a list of its 10 nearest semantic associates.
The model trained on Russian Wikipedia and Russian National Corpus will be used; for other models, visit Similar Words tab.
- 11/05/2018 — We have developed a tutorial, where we explain how text preprocessing is done, how to perform basic operations on word embeddings and how to use the RusVectōrēs API.
- 26/03/2018 — RusVectōrēs models won top rankings in the RUSSE'18 word sense induction shared task. Additionally, check the updated fastText model trained on the Araneum corpus; now it uses not only 3-grams, but also 4-grams and 5-grams.
- 05/01/2018 — fastText models and proper names in the PoS tags: learn about the new features we introduced in 2017.
- 30/06/2017 — We added the model trained on Araneum Russicum Maximum, which is one of the largest Russian web corpora (more than 10 billion words). In addition, all the models are re-evaluated using RuSimLex965 semantic similarity test set (it is more consistent than RuSimLex999).
- 30/06/2017 — Visualizations are substantially upgraded. You can now use several word sets as an input: they will be labeled with different colors in the visualizations. If there is only one set, the colors will correspond to the words’ parts of speech. Additionally, it is now possible to visualize your data in TensorFlow Embedding Projector with one mouse click.
- 09/03/2017 — We present a separate page for models where you can download both recent and archived models, and compare them against each other. Also, we added the links to Russian test sets and to the conversion table from Mystem to the Universal PoS Tags.
- 02/02/2017 — Learn about the new features we introduced in 2016 and watch new screencast video on working with RusVectōrēs.
- 02/02/2017 — Major update of the models: news corpus now covers events up until November 2016, Wikipedia dump is updated to the same date. Additionally, PoS tags are converted to the Universal Parts of Speech standard, and the models’ vocabularies now feature multi-word-entities (bigrams).
- 18/11/2016 — API now allows to get semantic similarity of word pairs. Query format: http://rusvectores.org/MODEL/WORD1__WORD2/api/similarity/
- 22/10/2016 — There are now hints as you type a query. Note that there are words not covered by hints as well (though they are rare and strange)!
- 01/07/2016 — For security reasons, the option to automatically train models on user-supplied corpora is now disabled. However, if you have an interesting corpus, contact us, and we will be glad to train a model for you.
- 07/04/2016 — RusVectōrēs source code is released on Github as Webvectors framework.
- 04/04/2016 — API now provides output in the json format. Example query — http://rusvectores.org/news/удар/api/json/
- 15/03/2016 — Web service with distributional models for English and Norwegian is launched, based on RusVectōrēs engine.
- 03/02/2016 — We fixed a bug because of which training user models was broken.
- 22/12/2015 — RusVectōrēs 2.0: Christmas Edition is officially released.
- 16/12/2015 — News model is updated. It is now trained on texts up to November 2015.
- 15/12/2015 — In Similar Words one can now filter results with the query part of speech.
- 11/12/2015 — API is implemented. It outputs 10 nearest neighbors for given word and model. There are two possible formats: json and csv. Example: http://rusvectores.org/news/удар/api/json/ or http://rusvectores.org/news/удар/api/csv/
Words in green are top frequent (corpus ratio higher than 0.00001); words in red, are low frequent (corpus ratio less than 0.0000005).