A collection of Russian corpora

A query to Russian corpora

   (Select Russian tags) (Select Russian tags&grams)
Russian National Corpus (modern, MSD) Parsed RNC (see dep)
Russian Standard Russian Newspapers Parsed ruWac (see dep)
Russian Internet Corpus Russian Livejournal VKontakte Wikipedia Russian Business Corpus
   CQP syntax only (Examples)   Getting help on the query interface
Centre for Translation Studies
Centre for Translation Studies

Set parameters of your query

Miscellaneous

Cyrillic: Transliterate input   Transliterate output
Default attribute: word    lemma
The page lists four corpora:
  1. a pilot version of the Russian National Corpus (50 million words, a representative collection of various genres, see http://ruscorpora.ru, the mirror here is provided by courtesy of the Moscow team), the complete set of sources is available from here;
  2. the corpus of Russian newspapers (70 MW, consisting of several major Russian newspapers, 2001-2004);
  3. the Russian Internet Corpus (160 MW, a snapshot of modern Russian language as used on the Internet; this is work in progress, which is similar to other Internet Corpora for Chinese, English and German);
  4. a corpus of Russian fiction (1.5 MW; its morphosyntactic features have been manually disambiguated by a team led by Vladimir Plungyan), the complete set of sources is available from here.

Copyright

The copyright to texts in the Russian National Corpus resides with respective publishers/authors. The texts cannot be distributed. You are not allowed to retrieve these texts or their portions beyond what is considered to be "fair use".

The interface to corpora developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

Part of Speech tagging

No definite tagset for Russian exists. There are several approaches to a computationally viable set of morphological categories, the most recent one is described on the RNC page, but it does not constitute a tagset. Top level categories from this description (e.g. S for nouns, V for verbs) are used for tagging the corpora listed above. There is an ongoing effort on creating a Russian tagset following the MULTEXT-East guidelines. Please, contact us if you want to take part in this development, see our TWiki page.

Case studies

The following links point to discussions on various topics pertaining to English-Russian translation on the Lingvo forums and their illustrations with examples from the corpus.
Discussions English examples in the corpus Russian examples in the corpus
claims vs. претензии take issue with, number of issues with, claim to fame есть претензии к, s boljshimi pretenzijami
вести себя неадекватно loose one's cool, out of touch, adequate behaviour вести себя неадекватно, не вполне адекватный
frustration frustratingly, frustration отчаяние, неудача, до обидного, досадно, что ...
"Cухой остаток" and "the bottom line" the bottom line cухой остаток
historically historically, typically исторически, традиционно
Congratualtions congratulation, congratulate, self-congratulatory поздравить, самодоволь*, хвастливый
Задел, наработка pilot project, exploratory/preliminary work/study задел, наработка
Naezd harassment, shakedown наезд
оперативный day-to-day/ongoing basis, regular updates, prompt оперативный, по оперативным данным, оперативность
preemptive vs. preventive preemtive, preemtion, preventive упреждающий, превентивный