Note: for the description of a Google Research Award project, see the project webpage
The jungle metaphor is quite common in genre studies. The subtitle of David Lee's seminal paper on genre classification is `navigating a path through the BNC jungle'. According to Adam Kilgarriff, the BNC is a jungle only when compared to smaller Brown-type corpora, while it looks more like an English garden when compared to the Web . A corpus from the web can easily surpass the BNC in size, see 160 million words of I EN or 2 billion words of ukWac (http://wacky.sslmit.unibo.it/).
However, we know little about the domains and genres of texts in corpora collected in this way. This webpage lists resources that make I-EN, I-RU (a web corpus of Russian) and ukWac a bit similar to the BNC. The procedure is described in the current draft of my paper:
Serge Sharoff, In the garden and in the jungle: comparing genres in the BNC and Internet. Submitted to Genres on the Web, to be published by Springer in 2008, PDF file
Texts in I-EN, I-RU and ukWac were automatically classified into the following classes:
The accuracy of this classification is about 73-84% (see the paper for argumentation), so you have one chance in four that a link is not of the correct type. Let me know if you have ideas on how to improve the accuracy.
The accuracy of this classification has not been validated. Presumably it is quite low (especially for the 70-genres classification from the BNC). A quick check was made for the genre distribution for 8310 pages from The Guardian website, which is a newspaper, so it should be classified as 'press' according to the Brown Corpus, but the genre of feature articles, biographies, reviews can be different from what is assumed by `press' in the Brown Corpus (it corresponds to 'reporting' in the classification used above):
| 10.01% | fiction |
| 29.07% | misc |
| 16.68% | nonfiction |
| 44.24% | press |
The following is the distribution of genres assigned to the same set of 8310 pages according to the BNC-trained classifier (only the 10 most frequent labels are listed):
| 3.14% | W_newsp_other_social |
| 3.21% | W_newsp_brdsht_nat_editorial |
| 3.29% | S_speech_unscripted |
| 3.35% | W_newsp_brdsht_nat_commerce |
| 3.61% | W_newsp_brdsht_nat_sports |
| 4.16% | W_fict_prose |
| 5.57% | W_pop_lore |
| 5.93% | W_newsp_brdsht_nat_arts |
| 6.45% | W_biography |
| 8.19% | W_newsp_brdsht_nat_misc |
| 11.01% | W_misc |
Not all items are treated as coming from newspapers, but many of them are (brdsht_nat means `national broadsheets', newsp_other means either regional or tabloid). Webpages classified as all forms of W_newsp account for 41% of The Guardian subcorpus.
The resources have been developed by Serge Sharoff (Centre for Translation Studies, University of Leeds). Get in touch with me if you have any suggestions.