Russian statistical taggers and parsers

by Serge Sharoff in cooperation with Tomaz Erjavec, Anna Feldman, Mikhail Kopotev, Dagmar Divjak, Joakim Nivre

Introduction to Russian tagging

This page lists resources that can be used for automatic tagging and morphosyntactic disambiguation of Russian.

Historically, research on morphological analysis and disambiguation of Russian can be traced back to the very beginning of computational linguistics. The first programs of this sort were developed in 1950s in the context of machine translation. A milestone in this research was Zalizniak's Grammatical Dictionary of Russian (Zalizniak, 1977), which provided a formal model of the diverse Russian morphology and led to a large number of implemented programs for Russian analysis and synthesis (e.g., Dialing, Mystem, Starling). However, these studies have not resulted in any actual tagset (as opposed to a set of formal morphological categories).

Similar attempts of studying the efficiency of HMM tagging have been made in:

However, they have not resulted in publicly available tagging resources (only a rule-based parser is available for the first experiment).

The tagging resources available from this page are based on a fairly large Russian tagset that follows the guidelines of the Multext East project http://nl.ijs.si/ME/

The basic idea is that for each major category (Noun, Verb, Adjective, etc) we have a fixed set of attributes (case, number, gender, animacy, etc), which can be encoded by attribute-value pairs or, in a more compact way, by single-word MorphoSyntactic Descriptions (MSDs), in which the position of each attribute is fixed and it is expressed by a one-letter code. For instance, the attribute-value specification

Category=Noun, Type = common, Gender = masculine, Number = singular, Case = accusative, Animate = no

corresponds to the MSD tag Ncmsan.

A short description of the tagset and evaluation of the resources is available in:

Serge Sharoff, Mikhail Kopotev, Tomaz Erjavec, Anna Feldman, Dagmar Divjak, Designing and evaluating Russian tagsets, In Proc. LREC 2008, Marrakech, May, 2008. lrec2008-msd.pdf

The current draft specification of the tagset is available from msd-ru.html.

Russian tagging resources

You can download the following resources:

  1. russian.par.gz - a parameter file to be used with TreeTagger, for using it you need a tokeniser and TreeTagger, available from http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
  2. russian-tnt.tgz - parameter files to be used with TnT, http://www.coli.uni-saarland.de/~thorsten/tnt/
  3. russian-svm.tgz - parameter files to be used with SVMTagger, http://www.lsi.upc.es/~nlp/SVMTool/
  4. ru-table.tab - a table mapping MSDs to a combination of POS categories (tab-separate file format);
  5. i-ru-sample.txt.gz - a Russian corpus for tagging experiments. It is a 5mln word subset from the Russian Internet Corpus (Sharoff, 2006 in http://wackybook.sslmit.unibo.it/);
  6. several tagged texts produced from the Russian sample by TreeTagger (i-ru-sample-tt.out.gz), TnT (i-ru-sample-tnt.out.gz) and SVMTagger (Model 0, i-ru-sample-svm0.out.gz; Model 2, i-ru-sample-svm2.out.gz; Model 5, i-ru-sample-svm5.out.gz)
  7. russian-small.par.gz - a parameter file for TreeTagger that uses a small tagset (only the basic POS tags are distinguished according to the model of http://www.ruscorpora.ru/)

All files use UTF-8 encoding. They have been trained on the disambiguated version of the Russian National Corpus (http://www.ruscorpora.ru), which cannot be made available because of copyright reasons. However, the tagged files from the Internet sample can provide a basis for further training. You are welcome to investigate problems in each tagged file or combine them (e.g., by majority voting) to produce a better training corpus.

Lemmatisation

The situation with lemmatisation in Russian is also complex. Many wordforms can map to several lemmas. Fortunately, the combination of a word form with an appropriate POS tag often leads to only one lemma. If the tagset can discriminate between many syntactic classes, the mapping can be completely free from the ambiguity. For instace, the word form стали can be either a noun (сталь, 'steel') or a verb (стать, 'to become'). However, a tagset distinguishing between only the basic parts of speech is not capable of mapping word forms like банки or физику to the right lemma (банк vs банка; физик vs физика). A more extensive tagset distinguishing nouns by their gender can do this task (provided that the tagger assigns the right tag).

The resources for TreeTagger can do lemmatisation by themselves, if you use -lemma for tagging (this is based on a form+pos-to-lemma lexicon). However, this mechanism does not help with lemmatisation of unknown word forms. Bart Jongejan and his colleagues from the Danish Center for Sprogteknologi developed CSTlemma, a tool that learns morphosyntactic rules from form+pos+lemma triples.

My lemmatisation tool is a wrapper around CSTlemma (which has to be downloaded and installed separately). The tool takes the output of TnT or TreeTagger, assigns lemmas from a dictionary in the TreeTagger format (which can be gzipped to save space) and uses cstlemma to guess unknown inflected forms (adjectives, nouns and verbs). A remark for guessed forms is left in the fourth column.

Russian dependency parsing

Statistical approaches can be also used to create parsers. In cooperation with Joakim Nivre I have created a Russian dependency parser for the Malt Parser (version 1.5). The training corpus for the parser was SynTagRus as developed by Igor Boguslavky, Leonid Iomdin and their colleagues (http://cl.iitp.ru/). This means that the set of dependency relations is the same as the output of the ETAP parser, for an overview of the labels see the Russian National Corpus (in Russian).

The script for running the Russian Malt parser on a text encoded in UTF-8 is invoked as:

./russian-malt.sh <input >output

In a competition of Russian dependency parsers in 2012 this simple parser produced fairly reliable results, ranking 3 out of 8 by the F-measure.

The parser has been applied to parse ruWac, a 2 billion word corpus of Russian (a representative snapshot of the Russian Web). The parsed file is available from here (warning, this downloads 9GB of compressed text).

To refer to the parser, please use:

Sharoff, S., Nivre, J. (2011) The proper place of men and machines in language technology: Processing Russian without any linguistic knowledge. Proc. Dialogue 2011, Russian Conference on Computational Linguistics. PDF



Serge Sharoff 2012-05-25