Reference Genre Benchmark

Serge Sharoff with help from Marina Santini

Genre labels

It is difficult to agree on the definite set of genre labels. Some studies put the number of genres to 2,000 [Görlach2004] or even 4,500 [Adamzik1995], which is too much for any practical purpose. What is more, the web is evolving. New genres and hybrids of existing genres appear all the time.

There have been attempts to create genre-annotated collections of webpages as listed in the following table:

Collection # pages # genres
HGC [Stubbe and Ringlstetter2007] 1280 32
I-EN [Sharoff2009] 250 7
KI-04 [Meyer zu Eissen and Stein2004] 1205 8
MGC [Vidulin et al.2007] 1239 20
SANTINIS [Santini2009] 1400 11
KRYS I [Berninger et al.2008] 5305 70

More information about each collection is available from a syndicated list at http://www.webgenrewiki.org/index.php5/Genre_Collection_Repository.

Each study listed here uses mutually incompatible genre labels: their number ranges from 7 to 70, their typology is based on different principles. However, for automatic genre identification we need a substantial reference benchmark using a shared set of labels to train machine learning algorithms.

Here I attempted to create such a benchmark by mapping the set of genre labels from each corpus to a single set based on labels from [Sharoff2009]:

  1. information - catalogues, glossaries, as well as purely informative texts like CVs, homepages, specifications or encyclopedic factsheets;
  2. instruction - how-tos, FAQs, tutorials;
  3. propaganda - adverts, political pamphlets;
  4. recreation - fiction and popular lore (this also includes narrative biographies and memoirs);
  5. regulations - laws, small print, contracts;
  6. reporting - factual texts reporting on a state of affairs, like newswires (including sport) and police reports;
  7. discussion - all texts expressing positions and discussing a state of affairs, the three main subtypes are public (corresponding to public debates, like blogs or opinionated journalistic texts), academic (research papers, books), and communication (spontaneous electronic communication, like discussion forums or chat rooms);
  8. unknown - this was reserved for webpages with little or no running text, like forms for queries, logins, download pages, flash animation, samples of source code, etc; one important subcategory here is index, i.e., portals, sitemaps, other lists of links (mostly containing incomplete or isolated sentences).

Each label in this set corresponds to a generalised aim of text production, e.g., instruction is for texts aimed at teaching how to achieve something, recreation is written for leisure-time reading. If a text cannot be comfortably classified as 1-6, it can be safely considered as discussion, unless it is not designed for reading as a normal text. For more information on the annotation principles see guidelines.pdf

The second problem with the unification of diverse genre collections concerns the difference in their storage methods. Some collections include webpages with their respective stylesheets, images and Javascripts, while others include only html pages proper. Some collections store files in a hierarchy of directories, while others contain flat lists. We unified the storage methods to the lowest common demoninator: html pages only in a flat list. For the PDF pages from KRYS-I we created their text versions using `pdftotext'.

For each collection there is another tab-separated file mapping its native categories (source) to one of the categories listed above (target). In some cases, a single lable in a source collection covers webpages of several different genres, so that its target label is not unique, `adult' in MGC covers lists of links, advertising, forms for accessing websites, legal disclaimers, instructions, etc. Documents with such labels have been discarded from the final mapping.

Genre mappings

The tables are stored in tab-separated plain text files, the collections have been compressed with bzip or a combination of tar+bzip (use tar -xjf file name to uncompress tbz files).

Unified corpora

Corpora and source labels:

  1. HGC: hgc.tbz; hgc.csv
  2. I-EN: i-en.txt.bz2; i-en.csv
  3. KI-04: ki-04.tbz; ki-04.csv
  4. MGC: mgc.tbz; mgc.csv
  5. SANTINIS: santinis.tbz; santinis.csv
  6. KRYS-I: krys-i.txt.bz2; krys-i.csv

Genre mapping

  1. HGC: hgc-map.csv
  2. I-EN: it was originally labelled with the above scheme
  3. KI-04: ki-04-map.csv
  4. MGC: mgc-map.csv
  5. SANTINIS: santinis-map.csv
  6. KRYS-I: : krys-i-map.csv

Bibliography

Adamzik1995
Adamzik, K. (1995).
Textsorten - Texttypologie. Eine kommentierte Bibliographie.
Nodus, Münster.

Berninger et al.2008
Berninger, V., Kim, Y., and Ross, S. (2008).
Building a document genre corpus: a profile of the KRYS I corpus.
In Proceedings of the Corpus Profiling Workshop, London.

Görlach2004
Görlach, M. (2004).
Text types and the history of English.
Walter de Gruyter.

Mehler et al.2009
Mehler, A., Sharoff, S., Rehm, G., and Santini, M., editors (2009).
Genres on the Web: Computational Models and Empirical Studies.
Springer, Berlin/New York.

Meyer zu Eissen and Stein2004
Meyer zu Eissen, S. and Stein, B. (2004).
Genre classification of web pages.
In Proceedings of the 27th German Conference on Artificial Intelligence, Ulm, Germany.

Santini2009
Santini, M. (2009).
Cross-testing a genre classification model for the web.
In [Mehler et al.2009].

Sharoff2009
Sharoff, S. (2009).
In the garden and in the jungle. Comparing genres in the BNC and Internet.
In [Mehler et al.2009].

Stubbe and Ringlstetter2007
Stubbe, A. and Ringlstetter, C. (2007).
Recognizing genres.
In Abstract Proceedings of the Colloqium "Towards a Reference Corpus of Web Genres.

Vidulin et al.2007
Vidulin, V., Luštrek, M., and Gams, M. (2007).
Using genres to improve search engines.
In Proc. Towards Genre-Enabled Search Engines: The Impact of NLP. RANLP-07.



Serge Sharoff 2009-09-14