Leeds Uni

Large Corpora used in CTS

Centre for Translation Studies
Centre for Translation Studies
The website http://corpus.leeds.ac.uk/ was originally designed to host comparable English and Russian corpora, but in time we have accumulated a variety of large corpora supported by a uniform search interface: "Leeds CQP", which is a CGI Perl frontend to IMS Corpus Workbench. Tools developed to work with corpora are listed on a separate page.

Monolingual corpora

English

  1. I-EN, a corpus of about 160 million words. This corpus has been compiled automatically from the Internet in 2005 along with other Internet corpora (for Chinese, French, German, Italian, Spanish, Polish and Russian).
  2. I-EN-CC, a corpus of about 160 million words consisting of pages labeled with a Creative Commons License. This means that the collection can be downloaded and reused in your research.
  3. The British National Corpus (BNC), a classic collection of samples of modern British English, 100 million words.
  4. the Reuters corpus, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19, 90 million words.
  5. A corpus of British News, a collection of newsstories from 2004 from each of the four major British newspapers: Guardian/Observer, Independent, Telegraph and Times, 200 million words.
Since BNC and Reuters require an agreement to monitor the users of their corpora, the interface requires a password, http://corpus.leeds.ac.uk/protected/query.html

Russian

  1. The Russian National Corpus, a collection of texts comparable to the BNC in its design, its pilot version has 100 million words (a more elaborated description of the project is available in Russian from "http://ruscorpora.ru)
  2. Russian Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
  3. a corpus of Russian newspapers, 78 million words (Izvestia, Trud and Strana.ru).
  4. the Russian Standard, a corpus of modern Russian fiction with manual disambiguation of morphological categories, 1.6 million words.
The interface to Russian corpora is available from http://corpus.leeds.ac.uk/ruscorpora.html

Chinese

  1. Chinese Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
  2. a fragment of LDC Chinese Gigaword corpus, 35 million words, tokenised and lemmatised using the NEUCSP tool from NLP Lab, North-Eastern University, China; the selection includes newswires for one year (2001); this makes it comparable to the Reuters corpus.
  3. Guo Jin's Chinese PH corpus, which is based on XINHUA news from 1990; segmentation done by Chris Brew and Julia Hockenmaier, 2,5 million words.
  4. Lancaster Corpus of Mandarin Chinese, a corpus of about 1 mln words, which is comparable in its design to Brown and LOB type corpora. Created by Tony McEnery and Richard Xiao, distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).
The interface to Chinese corpora is available from http://corpus.leeds.ac.uk/query-zh.html

Multilingual aligned corpora

  1. English-Russian, Russian-English fiction; a small parallel corpus of English and Russian fiction from the 19th century (aligned by A. Kretov, Voronezh);
  2. English-German corpus of European Parliament Proceedings; source texts were taken from Phil Köhn's page
  3. German-English Parallel Corpus "de-news"; also taken from Phil Köhn's page
  4. English-Japanese corpus of Yomiuri data (it is available in-house only)

Internet corpora

There are few large general corpora of the size of BNC (100 million words) available. Within Wacky (Web as Corpus) project we developed a set of procedures for collecting Internet corpora from the Internet and collected large representative corpora for for Arabic, Chinese, French, German, Italian, Spanish, Polish and Russian with the search interface available from http://corpus.leeds.ac.uk/internet.html.

The query interface to all corpora is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases, in particular, querying for lemmas and for exact word forms (all corpora have word, pos and lemma attributes, even if the latter is redundant for Chinese). Other possibilities include calculation of most significant collocations (using MI, T and loglikelihood scores) and searching for similar contexts in English, German and Russian corpora.

The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

Frequency listss

For some corpora I also computed the frequency lists (all lists use UTF-8 encoding):

There is also a frequency list of Georgian produced by Garold Shmaltsel and Givi Nozadze.

The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:

[word rank] [normalised frequency] [lemma, word form or POS]

Note that the frequency has been normalised to ipm: the number of instances of an individual word or POS tag per million words in respective corpora. Normalisation makes it possible to compare frequencies in the BNC against the Internet corpus. If you want to know the actual number of occurrences of a word listed there, multiply the frequency by the corpus size in million words (the size of a corpus is shown at the top of its frequency list). For instance, browser is used about 8556 times in the English Internet Corpus (47.17*181.376).

Finally, we have lists of distributionally similar words for English, German and Russian (words are said to be distributionally similar, if they share a significant amount of collocates in the corpus). The lists have been produced by Reinhard Rapp using Singular Value Decomposition (SVD).

The lists are distributed under the Creative Commons Attribution license.