Reference Genre Benchmark
Serge Sharoff with help from Marina Santini
It is difficult to agree on the definite set of genre labels. Some studies put the number of genres to 2,000 [Görlach2004] or even 4,500 [Adamzik1995], which is too much for any practical purpose. What is more, the web is evolving. New genres and hybrids of existing genres appear all the time.
There have been attempts to create genre-annotated collections of webpages as listed in the following table:
More information about each collection is available from a syndicated list at http://www.webgenrewiki.org/index.php5/Genre_Collection_Repository.
Each study listed here uses mutually incompatible genre labels: their number ranges from 7 to 70, their typology is based on different principles. However, for automatic genre identification we need a substantial reference benchmark using a shared set of labels to train machine learning algorithms.
Here I attempted to create such a benchmark by mapping the set of genre labels from each corpus to a single set based on labels from [Sharoff2009]:
- information - catalogues, glossaries, as well as purely informative texts like CVs, homepages, specifications or encyclopedic factsheets;
- instruction - how-tos, FAQs, tutorials;
- propaganda - adverts, political pamphlets;
- recreation - fiction and popular lore (this also includes narrative biographies and memoirs);
- regulations - laws, small print, contracts;
- reporting - factual texts reporting on a state of affairs, like newswires (including sport) and police reports;
- discussion - all texts expressing positions and discussing a state of affairs, the three main subtypes are public (corresponding to public debates, like blogs or opinionated journalistic texts), academic (research papers, books), and communication (spontaneous electronic communication, like discussion forums or chat rooms);
- unknown - this was reserved for webpages with little or no running text, like forms for queries, logins, download pages, flash animation, samples of source code, etc; one important subcategory here is index, i.e., portals, sitemaps, other lists of links (mostly containing incomplete or isolated sentences).
Each label in this set corresponds to a generalised aim of text production, e.g., instruction is for texts aimed at teaching how to achieve something, recreation is written for leisure-time reading. If a text cannot be comfortably classified as 1-6, it can be safely considered as discussion, unless it is not designed for reading as a normal text. For more information on the annotation principles see guidelines.pdf
The second problem with the unification of diverse genre collections concerns the difference in their storage methods. Some collections include webpages with their respective stylesheets, images and Javascripts, while others include only html pages proper. Some collections store files in a hierarchy of directories, while others contain flat lists. We unified the storage methods to the lowest common demoninator: html pages only in a flat list. For the PDF pages from KRYS-I we created their text versions using `pdftotext'.
For each collection there is another tab-separated file mapping its native categories (source) to one of the categories listed above (target). In some cases, a single lable in a source collection covers webpages of several different genres, so that its target label is not unique, `adult' in MGC covers lists of links, advertising, forms for accessing websites, legal disclaimers, instructions, etc. Documents with such labels have been discarded from the final mapping.
The tables are stored in tab-separated plain text files, the collections have been compressed with bzip or a combination of tar+bzip (use tar -xjf file name to uncompress tbz files).
Corpora and source labels:
- HGC: hgc.tbz; hgc.csv
- I-EN: i-en.txt.bz2; i-en.csv
- KI-04: ki-04.tbz; ki-04.csv
- MGC: mgc.tbz; mgc.csv
- SANTINIS: santinis.tbz; santinis.csv
- KRYS-I: krys-i.txt.bz2; krys-i.csv
- HGC: hgc-map.csv
- I-EN: it was originally labelled with the above scheme
- KI-04: ki-04-map.csv
- MGC: mgc-map.csv
- SANTINIS: santinis-map.csv
- KRYS-I: : krys-i-map.csv
- Adamzik1995
-
Adamzik, K. (1995).
Textsorten - Texttypologie. Eine kommentierte Bibliographie.
Nodus, Münster.
- Berninger et al.2008
-
Berninger, V., Kim, Y., and Ross, S. (2008).
Building a document genre corpus: a profile of the KRYS I corpus.
In Proceedings of the Corpus Profiling Workshop, London.
- Görlach2004
-
Görlach, M. (2004).
Text types and the history of English.
Walter de Gruyter.
- Mehler
et al.2009
-
Mehler, A., Sharoff, S., Rehm, G., and Santini, M., editors (2009).
Genres on the Web: Computational Models and Empirical Studies.
Springer, Berlin/New York.
- Meyer zu Eissen and
Stein2004
-
Meyer zu Eissen, S. and Stein, B. (2004).
Genre classification of web pages.
In Proceedings of the 27th German Conference on Artificial
Intelligence, Ulm, Germany.
- Santini2009
-
Santini, M. (2009).
Cross-testing a genre classification model for the web.
In [Mehler
et al.2009].
- Sharoff2009
-
Sharoff, S. (2009).
In the garden and in the jungle. Comparing genres in the BNC and
Internet.
In [Mehler
et al.2009].
- Stubbe and Ringlstetter2007
-
Stubbe, A. and Ringlstetter, C. (2007).
Recognizing genres.
In Abstract Proceedings of the Colloqium "Towards a Reference
Corpus of Web Genres.
- Vidulin et al.2007
-
Vidulin, V., Luštrek, M., and Gams, M. (2007).
Using genres to improve search engines.
In Proc. Towards Genre-Enabled Search Engines: The Impact of
NLP. RANLP-07.
Serge Sharoff
2009-09-14