Centre for Translation Studies, University of Leeds, develops and hosts a range of large representative corpora in a variety of languages (including English, Arabic, Chinese, French, German, Italian, Japanese, Spanish, Polish and Russian). Some corpora are available in-house only (because of copyright restrictions), while others can be accessed freely. The list of all corpora is available from a separate page.
Intellitext
Intellitext is a recent project funded by AHRC. It produced a versatile and intuitive interface offering a simple step-by-step approach to performing a corpus search. First-time and inexperienced corpus users can use the IntelliText Search Builder and Part-of-Speech Editor to build multi-word phrases and add grammatical information to their corpus queries – without having to enter complex string codes. Users may choose from seven search options:
- Concordance search [all languages]
- Collocation search [all languages]
- Affix search [all languages]
- Comparison of the frequency of two or more competing words or phrases [all languages]
- Frequency lists [all languages]
- Genre classification [German, Japanese, Russian]
- Multivariate analysis [English only]
Click here for the Intellitext interface
The comparable corpus of English and Russian news texts
Originally the website was created for making the query interface to the comparable corpus of English and Russian news texts.
The description of the corpus content
The English corpus is based on a subset of the
corpus of Reuters news, a collection of newswires from Reuters for
one year from 1996-08-20 to 1997-08-19. You can search trough a subset
of the corpus within texts annotated with general topic codes
(prefixed with 'G' in the Reuters classification). This includes
newswire texts concerning political events (GPOL), crime (GCRI),
entertainment (GENT), etc, but excludes news from markets, unless they
were explicitly annotated with general topic codes by Reuters corpus
developers. The corpus has been POS tagged and lemmatised using Helmut Schmidt's
TreeTagger.
There is some level of redundancy in the Reuters corpus. Some articles
(my rough estimate is about 10-15%) reuse much of their content from
other articles. This results in identical or almost identical lines in
the output concordance. Take this into account, when analysing
results.
The Russian corpus is based on articles from
Izvestia, a national broadsheet newspaper, and covers the period
from 2000 to 2001. The POS tagging and lemmatisation of the corpus
has been done using mystem.
The language of Russian newspapers can be compared against the first
version of the Russian Reference Corpus, which consists of about 50
million words and represents a variety of genres in Russian. The Russian
Reference Corpus was also used as the basis for development of the frequency
dictionary of modern Russian, its description and information for download
is available from
a separate
page.
The size of the corpora is summarised in the following table:
| Corpus | Size(in words) |
| Reuters subset | 83,491,119 |
| Izvestia | 14,564,884 |
| Russian Reference Corpus | 50,512,584 |
The interface will allow you to compare word uses between English and Russian as well as across two registers
in Russian (in the language of newspapers vs. the language of fiction). Even though the size of the corpora
varies, the first line of the output shows the relative frequency of
your search term in the corpus you have selected (in terms of the
number of occurrences of the term per million words).
The use of the corpus is restricted for research purposes
only. Because of the nature of our agreement with Reuters we have to
monitor the users of their subcorpus. This requires free registration
for interested users.
Click
here to fill the registration form. If you experience problems
with filling the form, contact Serge Sharoff, s.sharoff
leeds.ac.uk.
Click here to enter the corpus. Please, send your comments, suggestions and criticisms to Serge Sharoff, s.sharoff
leeds.ac.uk.
Click here to see the list other available corpora.