The query interface is powered by the IMS Corpus Workbench (CWB), but the syntax of its Corpus Query Processor (CQP) has been extended to simplify processing of some frequent cases.

In its simplest mode, the web interface allows you to enter a search term and search for its uses. You can also search for a list of words (a disjunctive condition) by separating them with | signs, e.g. 怨|愤 searching for fèn OR yuàn.

Note that CWB works with words, not strings, so our Chinese corpora have been tokenised by the NEUCSP tool. Insert space between words, when you are searching for a construction, for example, 说汉语 will find nothing, whereas there are many instances of 说 汉语 (with a space). Full names have been tokenised as a single word, e.g. 邓小平 (corresponding to Deng Xiaoping). Negation is included in the verb: 他 不会 说 汉语, etc. Pay attention to the way how words are split in concordance lines.

Because of fully automatic corpus tokenisation done by NEUCSP, some words are split in a less than optimal way. For instance, in 台湾正从一个开发中国家迈向已开发国家的行列 the expression for 中国家 is split as中国 家. Similarly, 女性 is split into 女 and 性 (thanks to Flemming Christiansen for pointing out the problem). If you can't find a word you're looking for, you can try shorter word forms or wild cards.

Remember that according to the CQP syntax the dot (.) is a metacharacter corresponding to any letter in the target string (it's quite frequently used in Internet texts instead of the ideographic full stop). You have to escape it with a backslash in queries, if you really mean to search for its occurrences in texts.

These possibilities are provided on top of the Corpus Query Processor, refer to the CQP Manual for the full description of the query language, for instance, if you want to specify conditions on word order. The values of POS tags can be selected by clicking on 'Chinese POS tags'. Morphological codes will be added to the query string, but ensure that your query conforms to the CQP syntax.

Warning Because of limitations of the Corpus WorkBench, if your query starts with a frequent word (such as a particle or a preposition), its processing will take A LOT OF time and frequently you'll get no result at all (because the time-out of your browser will expire). You can safely remove the first word a complex query, if it is frequent, but the context can be uniquely identified without it. Another warning: Information on the encoding used in webpages (GB1232, Unicode, etc) can be missing or wrong (this will affect the presentation of some characters in the interface).

The option Collocation statistics allows you to calculate the most significant collocates (LL score, MI score, T score) for the left or right negibour of your query; for the definitions see Chapter 5 "Collocations" from Christopher Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. 1999.

If you have further queries, contact Serge Sharoff,

Back to the Chinese corpus query page