The Internet corpora used here were developed using the same methodology as outlined in
Sharoff, S. (2006) Creating general-purpose corpora using automated search engine queries. In Marco Baroni and Silvia Bernardini, (eds), WaCky! Working papers on the Web as Corpus. Gedit, Bologna, http://wackybook.sslmit.unibo.it/;
|1.||Select about 500 words from a list of the most frequent word forms in your language. It is important that selected words are sufficiently general, i.e. they do not belong to a specific domain, but they are not function words. For instance, picture, extent, raised, events are good query words for English. For German I experimented with lowercase wordforms only (i.e. adjectives, adverbs and verbs), which also produce good results.|
Produce a list of 5000-6000 queries, each of which consists of 4 words (you may need more to get more links) using build_random_tuples.pl. If the language for which you want to collect a corpus is not listed by Google yet, add a couple of very frequent function words that are not used in cognate languages, e.g. (має or її) for Ukranian. If you collect a corpus for a language with relatively few Internet pages, you may decrease the number of words in a query (however, this will also decrease the amount of connected text in the pages returned, so you'll get more price lists, forms, catalogues, etc). Collect the top 10 URLs produced by Google for each query using collect_urls_from_google.pl
Download URLs produced by Google using print_pages_from_url_list.pl. The list of successfully downloaded URLs constitutes an open-source corpus.
|URL lists:||English (42133)||German (49505)||Russian (33811)||Chinese (30148)|
The set of downloaded files requires further postprocessing, such as correction of encodings, conversion of all texts to Unicode (I used GNU Recode for all languages except Chinese, for which a better tool exists: http://www.mandarintools.com/javaconverter.html), filtering out duplicate pages, removing navigation frames, etc, followed by lemmatisation and part-of-speech tagging. Some of the steps are covered by my filtercorpus.pl
Composition assessment: take a sample of about 200 texts from the corpus and describe them according to a text typology, as discussed in: Sharoff, S (2006) Creating general-purpose corpora using automated search engine queries. In M. Baroni, S. Bernardini (eds.) WaCky! Working papers on the Web as Corpus, Bologna, 2006. I select a random sample from the final URL list using getrandom.pl and coded it using Mick O'Donnel's Systemic Coder.
If you have another corpus for the same language, you can compare their frequency lists using the log-likelihood score (see Paul Rayson's Log-likelihood calculator)
Steps 2 and 3 above use customised versions of tools from Marco Baroni's BootCat, which also has a very extensive description of installation requirements and tool functions. Have a look at them.
The English CC corpus has been compiled from webpages with the Creative Commons permissive licences. The corpus is less balanced than the main I-EN (less professional news, more blogs and fanzines), but it can be redistributed without limitations.
The Perl scripts are free software. You can redistribute them and/or modify them under the same terms as Perl itself. The same applies to URL lists and other resources: you can freely use them in your research provided that you supply a link to this website: http://corpus.leeds.ac.uk/.
The interface and corpora were developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.