However, we know little about the domains and genres of texts in corpora collected in this way. This webpage lists resources that make I-EN, I-RU (a web corpus of Russian) and ukWac a bit similar to the BNC.
Below I report two ways of approaching the question of genre classification. One involves a traditional typology of typical genre labels, as applicable to the Web, for example, texts aimed at instructing, reporting or entertaining the reader. Another approach involves designing a topology to assess how similar individual texts are to a prototypical webpage, for example, a typical news item is aimed at reporting, but some of them also aim at entertaining, so that such texts are positioned between reporting and entertaining texts.
Serge Sharoff, (2018) Functional Text Dimensions for annotation of Web corpora. Forthcoming in Corpora, 31:2 PDF
The resources consist in multi-annotated webpages for Russian and English (along with translations of some pages into Chinese, French and German) as described in the following table:
|5g, part 2||en,fr,ru,zh||133||505468||5g-p2.tgz|
Serge Sharoff, In the garden and in the jungle: comparing genres in the BNC and Internet. In Genres on the Web, Mehler, A., Sharoff., S., Santini, M., (editors) Springer 2010. PDF
According to this approach, the texts in I-EN, I-RU and ukWac have been automatically classified using the following classes:
The accuracy of this classification is about 73-84% (see the paper above for argumentation), so you have one chance in four that a link is not of the correct type. Let me know if you have ideas on how to improve the accuracy.
The accuracy of this classification has not been validated. Presumably it is quite low (especially for the 70-genres classification from the BNC). I made a quick check for the genre distribution for 8310 pages from The Guardian website, which is a newspaper, so it should be classified as 'press' according to the Brown Corpus, but the genre of feature articles, biographies, reviews can be different from what is assumed by `press' in the Brown Corpus (it corresponds to 'reporting' in the classification used above):
The following is the distribution of genres assigned to the same set of 8310 pages according to the BNC-trained classifier (only the 10 most frequent labels are listed):
Not all items are treated as coming from newspapers, but many of them are (in the BNC genre scheme,
brdsht_nat means `national broadsheets',
newsp_other means either regional or tabloid). Webpages automatically classified as all forms of
W_newsp account for 41% of The Guardian subcorpus in ukWac.
The resources listed on this page have been developed by Serge Sharoff (Centre for Translation Studies, University of Leeds). Get in touch with me if you have any comments or suggestions.
Note: for files from the `Genres on the Web' colloquium (2007), see the original colloquium page
Note: for the description of a Google Research Award project, see the project webpage