Here you can download time-specific sub-corpora of the Russian National Corpus (RNC). Together these 3 sub-corpora form nearly full RNC.
The RNC is well balanced, contains Russian texts of diverse genres, and can be extremely useful for training NLP models.
For many years, the Russian National Corpus has been available only for online browsing and searching.
Now the raw RNC texts are published online for the first time.
The sub-corpora are provided as gzipped plain-text files, one sentence per line.
Sentences are shuffled due to copyright restrictions.
Please visit this page to sign the license agreement and download the corpora.
The work has been supported by the Ministry of Science and Higher Education of the Russian Federation within the Agreement No 075-15-2020-793.
The present dataset is a postprocessed fragment of the main subcorpus database. The main subcorpus is a part of the Russian National Corpus.
The original database is registered with the Federal Institute of Industrial Property, Russia (certificate 2009620185 dated 21.04.2009).