A collection of Chinese corpora and frequency lists

  Select Chinese POS tags  
Chinese Internet Lancaster Corpus of Mandarin Chinese Corpus of business Chinese
   CQP syntax only (Examples)   Click here for getting help on the query interface
Centre for Translation Studies
Centre for Translation Studies

Set parameters of your query

Frequency lists

For the Internet corpus and LCMC I also computed the frequency lists (both use the UTF-8 encoding): Note that tokenisation of texts into words follows the rules used in each corpus. Sometimes the results of tokenisation are not compatible, while some "words" in the frequency list of the Internet corpus can be parts of "real" Chinese words.

Chinese learners frequently ask about the frequency of individual characters (as this helps to order them in a reasonable sequence for learning). Numerous lists of common characters are available in various dictionaries (Oxford Dictionary, Wenlin or various online sources). They are often taken as the absolute, while they obviously depend on the corpus (the list in the Oxford Dictionary, for example, is skewed towards newspaper texts). The Chinese Internet corpus is a snapshot of the Chinese Web from 2005. The frequency list of characters coming from it might be more general (though still not ideal). The list of characters is available from here.

The first column is the rank, the second one is the frequency, which has been normalised per million characters. This means that if you read Internet texts, 的 will occur 38343 times per each million characters, 汽 — 205 times (rank 877), while (on average) you have to read about 100 million characters on the Internet to come across 腙 (in modern Chinese it is used for naming chemical compounds, e.g., 安巴腙 Ambazone).

The three corpora listed above are:

  1. Chinese Internet Corpus, 280 million words (tokens). This corpus has been compiled by Serge Sharoff from the Internet in February 2005 along with other Internet corpora (for English, German and Russian).
  2. The Lancaster Corpus of Mandarin Chinese, created by Richard Xiao and Tony McEnery
  3. Chinese Business Corpus, 30 million words (tokens). This corpus has been compiled by Serge Sharoff from the Internet in 2008 along with other business corpora (for English and Russian).
The interface to corpora developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

If you use these corpora in your studies, please refer to:
Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini, editors, WaCk y! Working papers on the Web as Corpus. Gedit, Bologna. PDF