RusVectōrēs: a yearly report

RusVectōrēs, our web service demonstrating distributional semantic models for Russian, continues to develop! During the past year we added lots of new features, and we are happy to share the news with you.

As you may remember, RusVectōrēs is a tool to play with distributional semantic models (a.k.a. word embeddings) right in your browser. Frameworks and algorithms for training word embedding models (word2vec, GloVe, fasttext and others) have revolutionized NLP. These approaches enable computers to learn the actual meaning of words based on their co-occurrence statistics calculated on large text corpora.

With our service, you can try out the models trained on various corpora for the Russian language. You can also download them to work locally on your machine. To walk you through our web service, we prepared a screencast with the brief explanation of the main features (in Russian):

RusVectōrēs can be used to explore the possibilities of distributional semantics, to promptly check linguistic hypotheses, or as a classroom tool.

In 2016, the service has improved in the following ways:

Our new address is https://rusvectores.org/en. The old address http://ling.go.mail.ru/dsm is still alive, but it is better to use the new one.
The models are retrained on updated corpora: the news corpus includes texts up to November 2016, the Wikipedia dump is also updated to this date, the texts from the Russian National Corpus are extracted in a more complete way.
All corpora have been processed with the language identifier. Thus, accidental sentences in Ukrainian, Belorussian or Kazakh languages have been filtered out.
Previously, our models employed part-of-speech tags that were standardized with Mystem/RNC. To facilitate multilingual comparison of results, we now employ Universal PoS Tags. Thus, «модель_S» has turned to «модель_NOUN». You can still enter the query without any PoS tag at all — RusVectōrēs will detect it automatically.
Two-word entities with high collocation strength (measured by PMI) were glued together with «::» and now possess their own representations (vectors). Thus, now there are some bigrams in our models, for example, «боб::дилан_NOUN».
The performance of all models was evaluated on widely known SimLex999 and Google Analogies test sets.
As you type a query, there are now adaptive hints with the words actually present in the models. If there are no hints, it does not always mean that there is no such word in the models: it is still possible that the word is simply not frequent enough to appear as a hint.
It is now possible to get the similarity measure for a word pair through API. You can also get results for the associates queries in two formats: json and csv. Learn more on the «About» page!
Many small bugs and errors were fixed. We have possibly added some new errors, but we will eventually fix them too.
The framework under the hood of RusVectōrēs has been released on Github as an open-source project WebVectors. This means that you can now create your own web service similar to RusVectōrēs with models and languages of your interest. You can look at the example of such service for Norwegian and English. In April, we are going to present WebVectors at the demo session of EACL-2017. If you happen to visit it, we would be happy to hear your opinion in person!

We have many plans for improving RusVectōrēs. Subscribe to our RSS feed and stay tuned!

RusVectōrēs team:
Andrey Kutuzov (University of Oslo, Higher School of Economics)
Elizaveta Kuzmenko (Higher School of Economics)

RusVectōrēs Team