This page lists resources that can be used for automatic tagging and morphosyntactic disambiguation of Russian.
Historically, research on morphological analysis and disambiguation of Russian can be traced back to the very beginning of computational linguistics. The first programs of this sort were developed in 1950s in the context of machine translation. A milestone in this research was Zalizniak's Grammatical Dictionary of Russian (Zalizniak, 1977), which provided a formal model of the diverse Russian morphology and led to a large number of implemented programs for Russian analysis and synthesis (e.g., Dialing, Mystem, Starling). However, these studies have not resulted in any actual tagset (as opposed to a set of formal morphological categories).
Similar attempts of studying the efficiency of HMM tagging have been made in:
However, they have not resulted in publicly available tagging resources (only a rule-based parser is available for the first experiment).
The tagging resources available from this page are based on a fairly large Russian tagset that follows the guidelines of the Multext East project http://nl.ijs.si/ME/
The basic idea is that for each major category (Noun, Verb, Adjective, etc) we have a fixed set of attributes (case, number, gender, animacy, etc), which can be encoded by attribute-value pairs or, in a more compact way, by single-word MorphoSyntactic Descriptions (MSDs), in which the position of each attribute is fixed and it is expressed by a one-letter code. For instance, the attribute-value specification
Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no
corresponds to the MSD tag Ncmsan.
A short description of the tagset and evaluation of the resources is available in:
Serge Sharoff, Mikhail Kopotev, Tomaz Erjavec, Anna Feldman, Dagmar Divjak, Designing and evaluating Russian tagsets, In Proc. LREC 2008, Marrakech, May, 2008. lrec2008-msd.pdf
The current draft specification of the tagset is available from msd-ru.html.
You can download the following resources:
All files use UTF-8 encoding. They have been trained on the disambiguated version of the Russian National Corpus (http://www.ruscorpora.ru), which cannot be made available because of copyright reasons. However, the tagged files from the Internet sample can provide a basis for further training. You are welcome to investigate problems in each tagged file or combine them (e.g., by majority voting) to produce a better training corpus.
The resources for TreeTagger can do lemmatisation by themselves, if you use -lemma for tagging (this is based on a form+pos-to-lemma lexicon). However, this mechanism does not help with lemmatisation of unknown word forms. Bart Jongejan and his colleagues from the Danish Center for Sprogteknologi developed CSTlemma, a tool that learns morphosyntactic rules from form+pos+lemma triples.
My lemmatisation tool is a wrapper around CSTlemma (which has to be downloaded and installed separately). The tool takes the output of TnT or TreeTagger, assigns lemmas from a dictionary in the TreeTagger format (which can be gzipped to save space) and uses cstlemma to guess unknown inflected forms (adjectives, nouns and verbs). A remark for guessed forms is left in the fourth column.
The script for running the Russian Malt parser on a text encoded in UTF-8 is invoked as:
./russian-malt.sh <input >output
In a competition of Russian dependency parsers in 2012 this simple parser produced fairly reliable results, ranking 3 out of 8 by the F-measure.
The parser has been applied to parse ruWac, a 2 billion word corpus of Russian (a representative snapshot of the Russian Web). The parsed file is available from here (warning, this downloads 9GB of compressed text).
To refer to the parser, please use:
Sharoff, S., Nivre, J. (2011) The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge. Proc. Dialogue 2011, Russian Conference on Computational Linguistics. PDF