A collection of Russian corpora
The page lists four corpora:
a pilot version of the Russian National Corpus (50 million words, a representative collection of various genres, see http://ruscorpora.ru, the mirror here is provided by courtesy of the Moscow team), the complete set of sources is available from here;
the corpus of Russian newspapers (70 MW, consisting of several major Russian newspapers, 2001-2004);
the Russian Internet Corpus (160 MW, a snapshot of modern Russian language as used on the Internet; this is work in progress, which is similar to other Internet Corpora for Chinese, English and German);
a corpus of Russian fiction (1.5 MW; its morphosyntactic features have been manually disambiguated by a team led by Vladimir Plungyan), the complete set of sources is available from here.
The copyright to texts in the Russian National Corpus resides with respective publishers/authors. The texts cannot be distributed. You are not allowed to retrieve these texts or their portions beyond what is considered to be "fair use".
The interface to corpora developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.
Part of Speech tagging
No definite tagset for Russian exists. There are several approaches to a computationally viable set of morphological categories, the most recent one is described on the RNC page, but it does not constitute a tagset. Top level categories from this description (e.g. S for nouns, V for verbs) are used for tagging the corpora listed above. There is an ongoing effort on creating a Russian tagset following the MULTEXT-East guidelines. Please, contact us if you want to take part in this development, see our TWiki page.
The following links point to discussions on various topics pertaining to English-Russian translation on the Lingvo forums and their illustrations with examples from the corpus.