Towards a Reference Corpus of Web Genres
Colloquium held in conjunction with Corpus Linguistics 2007
Organizers: Marina Santini and Serge Sharoff
Birmingham, UK, - July 27, 2007
Colloquium website: http://corpus.leeds.ac.uk/serge/webgenres/
Corpus Linguistics 2007 website: http://www.corpus.bham.ac.uk/conference2007
Genres of spoken and written texts are being intensively studied from various angles, e.g., communication studies, discourse analysis, computational linguistics, without arriving at a generally accepted definition. Many corpora have been built to represent the language, but very few large corpora indicate genres, and when they do the typology of genres varies widely. For instance, the Brown corpus famously uses 15 textual categories, from press reportage (a text genre) to religion or skills and hobbies (domains), while the British National Corpus (BNC) uses 70 classes, such as academic or non-academic scientific texts or biography. Interestingly, genre classes in the BNC are an add-on proposed by David Lee (Lee, 2001) after the corpus construction, rather than a basic criterion of the corpus creation. The genre attribute was included in a few collections used in information retrieval (TREC HARD 2003 and 2004, or TREC-2006 Blog Track), but the set of genres proposed was either debatable (e.g. the ‘reaction’ genre in TREC HARD 2003), or limited to a single genre (e.g. the blog genre in TREC-2006 Blog Track).
The web is new, so it is even less not clear how to apply traditional notions of genre to web documents. In corpus-based genre studies, the main tendency has been to build one's own genre collection according to subjective criteria for corpus composition, genre annotation, and genre granularity. Genre annotation has been based either on the common sense of a single rater, or on the agreement of few annotators. In brief, as it is now, web genre analyses remain self-contained and corpus-dependent.
Building a reference corpus of web genres is certainly difficult because web documents are often characterised by a high level of genre hybridism, by a fragmentation of textuality across several documents, by the impact of technical features such as hyperlinking, posting facilities and multi-authoring. Since the web is a huge reservoir of documents that can be easily mined for building all sorts of corpora, it is important to overcome the subjectivity that characterizes genre-related issues, in order to create sharable resources. What should we consider when designing a reference corpus of web genres? Genres of web documents show some traits that are not accounted for in TREC collections or in the BNC and that are, instead, important on the web. For example:
The rationale for this colloquium is to draw up an initial
list of characteristics and requirements for building, annotating and evaluating
reference corpora of web genres.
Four presentations prepared for the colloquium report empirical results and offer hands-on answers to some of these questions. More precisely, Alexander Mehler and Rüdiger Gleim analyse web genres at website level and suggest a database-like form of storage. They offers an interesting angle on the notion of web genres using structural and linking information. Barbara H. Kwasnik, Kevin Crowston, Joseph Rubleske, You-Lee Chun tell us how they built a corpus of genre-tagged web pages to populate their genre collection. Serge Sharoff focuses on the similarities between web-derived corpora and classical corpora constructed from print media. Finally, Mark Rosso describes his experience in assembling a genre palette that could be useful for building a genre reference corpus to help web searches.
Three further presentations describe settings of ongoing or future research, and provide preliminary answers to some of the problems listed above. More precisely, Andrea Stubbe and Christoph Ringlstetter discuss two important aspects in web genre research: granularity of genre hierarchies and multi-genre classification. Andrea Stubbe, Christoph Ringlstetter, Tong Zheng, and Randy Goebel present an intriguing idea: a genre classifier that adapts to the information need of a specific user on the basis of user events. They report on how to assemble a genre-annotated corpus. Finally, Cornelius Puschmann proposes an XML-based storage schema for the compilation of computer-mediated discourse (CMD) corpora from mixed sources.
Building a genre-annotated reference corpus of web pages is arduous for a number of reasons, and several solutions appear to be viable. In this colloquium, we would like to make a first attempt to apply the concept of genre to the development of sharable criteria for building genre corpora.
The ambition of this colloquium, the first ever organized on this topic, is to bring together researchers from different communities such as corpus linguistics, genre analysis, digital genre community, computational linguistics, and information retrieval in order to promote the discussion and development of new ideas and methods to create new corpora for language studies and as evaluation resources.
Alexander Mehler and Rüdiger Gleim: A Corpus Model of Structure Formation in Hypertext Types
This paper describes a web genre corpus model. Its starting point is a graph model of the logical document structure of hypertext types and of the linkage of their constituents. We describe an XML-based serialization of this model and provide a database mapping which retains a wide range of web genre data. This will be exemplified by three web genres.
Barbara H. Kwasnik, Kevin Crowston, Joseph Rubleske and You-Lee Chun:
Building a Corpus of Genre-Tagged Webpages for an Information-Access Experiment
This presentation reports on one phase of a larger study whose overarching aim is to determine how providing genre metadata can help in access to sources of information in a digital environment. We have built a corpus of genre-tagged web pages and structured this particular experimental corpus in such a way as to provide the maximum control for our experiments. We recognize, however, that much rich genre information was either too difficult to represent or had to be pared away.
In the garden and in the jungle: comparing genres in the BNC and Internet
According to Adam Kilgarriff the BNC is a jungle when compared to smaller Brown-type corpora, but it looks more like an English garden when compared to the Internet. In this presentation I will compare English and Russian Internet corpora against their human-collected counterparts (BNC and RNC) using two methods: the first involves manual annotation of a subset of Internet corpora, the second one uses probabilistic classifiers. The study shows that the Internet is not radically different from the BNC: Internet corpora do contain a wide range of genres and approximate many genres that exist in their printed form, the same is true for the audience level (texts for professional or layman texts).
Development of a Genre Palette
This presentation details the development of a genre palette used in the study of the effects of genre-annotated search results on the relevance judgement process in a web search environment. This palette development was conducted in several phases: (i) a survey of user terminology; (ii) user-based refinement of terminology into a tentative genre palette, and (iii) user validation of the genre palette.
Andrea Stubbe and Christoph Ringlstetter:
We introduce a two-level hierarchy of genres based on the definition of genre in terms of form and function (or purpose). Thereby we provide sufficient granularity with the possibility to return to a coarser scheme when preferable. As some texts may naturally fall into more than one genre, an assignment to multiple classes is possible. For those applications where a unique class is required, several techniques for the combination of classifiers were evaluated.
Andrea Stubbe, Christoph Ringlstetter, Tong Zheng, and Randy Goebel:
Incremental genre classification
In this presentation we will describe attempts to acquire data. These attempts have to consider the users explicitly and cooperatively. The user behaviour will be simulated using annotated corpus data. We will also formulate different scenarios for information gain representing different levels of uncertainty. Our goal is to integrate existing material of different sources into a realistic application.
SchemaCMD: An XML-based storage schema for the compilation of mixed-source CMD corpora
This presentation will outline an XML schema for the segmentation and storage of data from Internet sources, specifically those which utilize so-called web feeds (often associated with the RSS protocol). It is based on the faceted classification scheme recently proposed by Susan Herring and aims to make data from diverse sources accessible and comparable in a single format.
Information on registration and registration fees are provided at the CL2007 website: http://www.corpus.bham.ac.uk/conference2007
Colloquium is scheduled for Friday, 27 July 2007
Corpus Linguistics 2007 Venue: University of Birmingham, Birmingham, UK
Marco Baroni (University of Trento, Italy)
Stefan Gries (University of California, USA)
Adam Kilgarriff (Lexmasterclass, UK)
Alexander Mehler (Bielefeld University, Germany)
Sven Meyer zu Eissen (University of Weimar, Germany)
Paul Rayson (UCREL, Lancaster University, UK)
Georg Rehm (University of Tuebingen, Germany)
Marina Santini (University of Brighton, UK)
Serge Sharoff (University of Leeds, UK)
Benno Stein (University of Weimar, Germany)
Marina Santini (University of Brighton, UK)
Personal Home Page: http://www.nltg.brighton.ac.uk/home/Marina.Santini/
Serge Sharoff (University of Leeds, UK)
Personal Home Page: http://corpus.leeds.ac.uk/serge/
For questions or comments, please contact Marina Santini (MarinaSantini.MSgmail.com), or Serge Sharoff (s.sharoffleeds.ac.uk).