Centre for Translation Studies

Use of corpora in translation studies

Centre for Translation Studies, University of Leeds, develops and hosts a range of large representative corpora in a variety of languages (including English, Arabic, Chinese, French, German, Italian, Japanese, Spanish, Polish and Russian). Some corpora are available in-house only (because of copyright restrictions), while others can be accessed freely. The list of all corpora is available from a separate page. intellitext


Intellitext is a recent project funded by AHRC. It produced a versatile and intuitive interface offering a simple step-by-step approach to performing a corpus search. First-time and inexperienced corpus users can use the IntelliText Search Builder and Part-of-Speech Editor to build multi-word phrases and add grammatical information to their corpus queries – without having to enter complex string codes. Users may choose from seven search options: Click here for the Intellitext interface

The comparable corpus of English and Russian news texts

The English corpus is based on a subset of the corpus of Reuters news, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19. You can search trough a subset of the corpus within texts annotated with general topic codes (prefixed with 'G' in the Reuters classification). This includes newswire texts concerning political events (GPOL), crime (GCRI), entertainment (GENT), etc, but excludes news from markets, unless they were explicitly annotated with general topic codes by Reuters corpus developers. The corpus has been POS tagged and lemmatised using Helmut Schmidt's TreeTagger. There is some level of redundancy in the Reuters corpus. Some articles (my rough estimate is about 10-15%) reuse much of their content from other articles. This results in identical or almost identical lines in the output concordance. Take this into account, when analysing results.

The Russian corpus is based on articles from Izvestia, a national broadsheet newspaper, and covers the period from 2000 to 2001. The POS tagging and lemmatisation of the corpus has been done using mystem.

The language of Russian newspapers can be compared against the first version of the Russian Reference Corpus, which consists of about 50 million words and represents a variety of genres in Russian. The Russian Reference Corpus was also used as the basis for development of the frequency dictionary of modern Russian, its description and information for download is available from a separate page.

The size of the corpora is summarised in the following table:
CorpusSize(in words)
Reuters subset83,491,119
Russian Reference Corpus50,512,584

The interface will allow you to compare word uses between English and Russian as well as across two registers in Russian (in the language of newspapers vs. the language of fiction). Even though the size of the corpora varies, the first line of the output shows the relative frequency of your search term in the corpus you have selected (in terms of the number of occurrences of the term per million words).

