The query interface is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases.
In its simplest mode, the web interface allows you to enter a search term (a lemma) and search for its uses. You can also search for a list of words (a disjunctive condition) by separating them with | signs, e.g. indignation|resentment searching for indignation OR resentment. You can search for a substring, when the search term contains .*, e.g. indigna.* finds indignant and indignation (take care, when using the abbreviation and word list options with more frequent words, the output may include many thousands of lines).
If you enter a sequence of words (separated with spaces), you will search for an exact phrase (without lemmatisation). In order to search for lemmas you need to add % to the end of each term. For instance, the query set% in will find set in, sets in, setting in, Set in etc, while set in will find only set in (also case-sensitive, even Set in does not fit the pattern). At the same time, there is no sense in searching for sets% in, because there is no lexical item sets. Because of corpus tokenisation, all punctuation marks have to be separated from words with a space, e.g. finally , to search for finally followed by a comma. Remember that according to the CQP syntax the dot (.) is a metacharacter corresponding to any letter in the target string. You have to escape it with a backslash in queries, if you want to search for its occurrences in texts: finally \. to find finally at the sentence end.
To specify a gap between words, you can use two dots: ... For example, make% .. up will also find examples of made it up. To find more examples like made it up one can use the pos tag, e.g. make% .. /N.*|PP up
Lemmatisation in Russian also assumes the mapping between closely related aspectual pairs. The corpus has no lemma like повернуть, because its lemma is поворачивать. Another by-product of lemmatisation is that all lemmas in English and Russian corpora are in the lower case (you will search for moscow and москва).
Warning In the German corpus, capital letters in lemmas are
important, because they can help in disambiguation: Denken
vs. denken. Therefore lemmatisation in German does little case folding.
You can also use the full CQP syntax option to deal with
case-independent search:
[lemma="denken"%c]
Another feature of German lemmatisation (caused by peculiarities in the TreeTagger) is that forms combining prepositions with articles are not decomposed: zum is a lemma, while all forms of the definite article have the lemma d, for instance, if you are looking for examples of expression Tod an den Hals wünschen, the lemma-based query should be Tod an d Hals wünschen. You can also search for "Tod an den Hals" wünschen (i.e. to treat the first three search terms as exact word forms)
These possibilities are provided on top of the Corpus Query Processor, refer to the CQP Manual for the full description of the query language, for instance, if you want to specify conditions on word order or restrict morphological properties of words in your query. The values of morphological properties for English and Russian words can be selected by clicking on respective links and choosing values from the popup windows. Morphological codes will be added to the query string, but ensure that your query conforms to the CQP syntax. The interface simply adds strings like
[pos="NNS"] [pos="A.*,род.*"]to the query string, so, if you need to combine the POS and lemma conditions manually:
[lemma="promt" & pos="JJ"] [lemma="стакан" & pos=".*род.*"]
Warning Because of limitations of the Corpus WorkBench, if your query starts with a frequent word (such as a particle or a preposition), its processing will take A LOT OF time and frequently you'll get no result at all (because the time-out of your browser will expire). You can remove the first word a complex query, if it is frequent while the context can be uniquely identified without it. For instance, always use Tod an "den" Hals wünschen, but not den Tod an "den" Hals wünschen, because the Corpus WorkBench will not return any result for the latter.
Morphological features in the Russian fiction corpus have been manually corrected, so morphological codes in it are reliable. The Russian newscorpora and the Internet corpus has been processed automatically by mystem, so it contains many ambiguities; for instance, книги is always analysed as sing,gen; adverbs derived from adjectives (трудно) are always analysed as adjectival forms (трудный). Thus, it is not possible to rely on morphological codes in other Russian corpora.
If you want to query the Russian corpus without a Cyrillic keyboard, you can use transliteration according to the following table:
'а' => 'a', 'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ж' => 'zh', 'з' => 'z', 'и' => 'i', 'й' => 'j', 'к' => 'k', 'л' => 'l', 'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's', 'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'x', 'ц' => 'c', 'ч' => 'ch', 'ш' => 'sh', 'щ' => 'w', 'ъ' => 'qh', 'ы' => 'y', 'ь' => 'q', 'э' => 'eh', 'ю' => 'ju', 'я' => 'ja',
(the same transliteration scheme as in the Tübingen interface to the Uppsala corpus)
You can sort the output of your query according to its left or right context. Note that the match itself and the punctuation marks are treated as parts of the context, so a query using go will be sorted according to the right context as follows (first go, then going, then went):
. Help provided by ACET volunteers eventually | go | the flat ship-shape again and life became easier . Illnesses |
're worried that you 've recently take a risk , | go | to the special STD clinic at your local hospital . |
Peter , " and the lengths to which ACET staff | go | to try and meet the needs of clients . I |
. Over the next decade a global approach is | going | to be essential . " A00CA002 Superintendent Trobridge of Ealing |
call the office that you can report how the visit | went | . We also hold regular meetings of volunteers to discuss |
. After a short interview with the BBC , Cliff | went | to meet ACET client Tony Chapman at his home , |
The output ends with a set of basic translation equivalents for words in the query. The facility is based on the English-Russian dictionary developed by Multitran, thanks to the Multitran development team, especially to Andrei Pominov. Every word in the list of translation equivalents links to a new query.
The option Collocation statistics allows you to calculate the most significant collocates (using log-likelihood, mutual information or T score) for the left or right neighbour of your query; for the definitions see Chapter 5 "Collocations" from Christopher Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. 1999.
The option Word similarity search allows you to find semantic classes for some lines in the output of your query. This is based on Reinhard Rapp's procedure using Singular Value Decomposition (SVD).