Web Document Classification (WebDoC)

Serge Sharoff

Centre for Translation Studies, School of Modern Languages

Katja Markert

School of Computing,


Research abstract and goals

The project aims to classify web pages automatically. There are many different kinds of documents on the web, from games to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. A query for `Venice' results in pages of various types, referring to recent news, information about history, guidebooks, hotel lists, opinions about hotels and restaurants, etc. For many applications (language teaching, machine translation, information retrieval and extraction) it is also important to have the possibility of selecting a subcorpus according to specific parameters - such as encyclopedic knowledge vs. instructions, texts written for professionals vs. for the general public, or opinions vs. factual text.

Hand in hand with classifying pages, we need to identify the categories we shall classify them into. The web is new, and this is not an area that has been widely researched to date. We shall adopt an iterative approach by classifying samples of web pages to see which pages fit the existing classification scheme, and amending the scheme to allow for those that do not.

The project is supported by a Google Research Award for 2009-2010.

Expected outcomes and results

Our research questions in this project are:

  1. Which features of webpages are useful for their automatic classification?
  2. How language-specific are those features?
  3. How can we build efficient classifiers that operate with minimal human intervention and take into account the special nature of the web as a corpus?
  4. What is the accuracy of automatic classification?

The project will deliver:

  1. a text typology suitable for classifying the majority of existing web pages;
  2. automatic classifiers to assign arbitrary web pages to categories in this typology;
  3. a method for porting the classifiers to a new language;
  4. a set of corpora with documents classified according to the typology to be developed.

We shall work on different language families, so that the method can be shown to be portable to further languages. We will be testing the approach using webpages in English, Chinese, German and Russian.

Technical description


Typology of webpages and features

In this project we will study two categories which are relevant for web pages, are capable of sensible classification of any web page and can be identified reliably:

Some genres can be associated with only some domains, but genres are generally shared between many different domains. For instance, the example of a query for `Venice' shares a set of genres with many other queries about products: news, encyclopedic information, product lists, opinions. The task of the project is to define a set of generalised genre categories that occur frequently across domains. The exact set is to be defined from a corpus of about 500 texts for each language in the project. We already have collections of about 50,000 webpages per language to create a representative sample [Sharoff, 2006]. More pages can be crawled if necessary.

Feature identification

We will identify features that are indicative for each category in the typology. We will consider three types of intratextual features: lexical (keywords, frequency bands, n-grams, lexical density, etc), grammatical (using only generalised features which can be obtained from a POS tagger, such as POS frequencies and n-grams, closed-class words, punctuation), and text statistics (average document or sentence length, markup statistics etc). In addition, we will consider text similarity as expressed both via the textual similarity between two web pages as well as hypertext links.

Our research [Sharoff, 2007] indicates that genres can be identified by POS n-grams, and audience types by lexical density. However, these preliminary findings will have to be tested in the project. The outcome of the feature analysis will also inform further changes necessary in the text typology, if some categories are found to be too difficult to detect automatically.

The resulting typology will maintain the trade-off between reliability (there is little sense in building classifiers for categories that cannot be detected reliably, the year of text production) and usefulness (some categories are of great interest to web users, even if they are difficult to detect, the genre).


Feature generalisation and weighting

The output of the feature detection study will be a seed set of feature types. To generalise keywords and address data sparseness we will experiment with similarity-based bootstrapping, in which a seed set of keywords will be generalised using their first and second order co-occurrence features, following the techniques used in distributional similarity [McCarthy et al., 2004]. To reduce the number of features to study, we will experiment with three methods which can detect those features most predictive of a given outcome: singular value decomposition [Berry et al., 1999], rough sets [Jensen and Shen, 2004] and non-redundant clustering [Gondek and Hofmann, 2005].


Algorithms

Regarding classifiers, we will experiment with several types of algorithms.

Supervised algorithms

First, we will use classical supervised machine learning methods such as Naive Bayes and SVM with a focus on intra-textual features to establish a performance baseline.

Graph-based semi-supervised learning

There are two problems with the above standard supervised approach. First, sufficient labeled data is expensive to produce and secondly, it does not take into account the rich hypertext and metadata structure on the Web. To give an example, we would expect that linked web pages are more likely to address the same audience type than web pages that are not linked. We therefore propose graph-based semi-supervised learning techniques such as Minimum Cut [Blum and Chawla, 2001] or label propagation algorithms that aim at a globally optimal classification of interlinked items in contrast to the isolated classification of individual texts in traditional machine learning techniques. These will see the documents of the Web as vertices with edges between them weighted according to link structure as well as textual similarity. We also hypothesize that these graph-based techniques support porting to other languages as (i) a semi-supervised approach limits human annotation for a new language and (ii) the link structure (in contrast to the textual content) is mostly language-independent. [Yang et al., 2002] explore the use of hypertext links for Web classification but they concentrate on topic classification only and also do not use global optimisation. [Liu et al., 2006] use a graph-based semi-supervised learning algorithm successfully, lending support to our approach, but concentrate on a small subcorpus of computer science pages with a clearly established typology only and limit themselves to English.

Joint classification tasks

The methods proposed so far classify a text according to each axis separately, i.e. the classification with regard to audience type is independent from the domain or genre classification. We will investigate joint classification along different axes, looking at joint objective functions.

Supporting different languages

Our framework will support web page classification in different languages with minimal new annotation via several different strategies: (i) semi-supervised learning (see above for a longer discussion); (ii) inclusion of features that are more language-independent than lexical features (such as density and link structure; see above) (iii) use of cross-language information on the Web. This can include links between sites in different languages as well as use of web pages which exist in several languages. As an example, such a web page can be used in collecting cheap training data for a new language or as a classification constraint so that the English and the non-English versions have to be classified with the same label.

Existing research

Automatic text classification by topic has a long history. Text classification according to genre (or or text type) is a much newer field, with far fewer well-established methods and results. Likewise, the discovery and analysis of new web genres. To date there has been only the most prelimanry work at this intersection, mostly limited to one language, namely English.

Until recently there was very little work, under either the `document classification' or the `web as corpus' headings, which uses linguistically-informed notions of genre. [Kessler et al., 1997] explore classification according to genre, but not on the web, and [Dillon and Gushrowski, 2000] map out new web genres and how they may be used to improve searching. More recent work in genre classification includes [Finn and Kushmerick, 2006,Rehm, 2002,Meyer zu Eissen and Stein, 2004]. In a similar way, there has been some research on automatic authorship and audience detection, [Baroni and Bernardini, 2006,Heilman et al., 2008]. However, all of this work is at an early stage, with taxonomies of genres fluid, limited coverage of the complete set of webpages (homepages only) and no common criteria for genre annotation [Santini, 2007].

Bibliography

Baroni and Bernardini, 2006
Baroni, M. and Bernardini, S. (2006).
A new approach to the study of translationese: Machine-learning the difference between original and translated text.
Literary and Linguistic Computing, 21(3):259-274.

Berry et al., 1999
Berry, M., Drmac, Z., and Jessup, E. (1999).
Matrices, vector spaces, and information retrieval.
SIAM Review, 41(2):335-362.

Biemann, 2007
Biemann, C. (2007).
Unsupervised and Knowledge-Free Natural Language Processing in the Structure Discovery Paradigm.
PhD thesis, University of Leipzig.

Blum and Chawla, 2001
Blum, A. and Chawla, S. (2001).
Learning from labeled and unlabeled data using graph mincuts.
In Proceedings of the Eighteenth International Conference on Machine Learning, pages 19-26.

Dillon and Gushrowski, 2000
Dillon, A. and Gushrowski, B. A. (2000).
Genres and the web: is the personal home page the first uniquely digital genre?
J. Am. Soc. Inf. Sci., 51(2):202-205.

Finn and Kushmerick, 2006
Finn, A. and Kushmerick, N. (2006).
Learning to classify documents according to genre.
Journal of the American Society for Information Science and Technology, 57(11).

Gondek and Hofmann, 2005
Gondek, D. and Hofmann, T. (2005).
Non-redundant clustering with conditional ensembles.
In Proc. of ACM SIGKDD conference on Knowledge discovery in data mining, pages 70-77.

Heilman et al., 2008
Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008).
An analysis of statistical models and features for reading difficulty prediction.
In Proc the Third Workshop on Innovative Use of NLP for Building Educational Applications, pages 71-79, Columbus, Ohio. ACL.

Jensen and Shen, 2004
Jensen, R. and Shen, Q. (2004).
Semantics-preserving dimensionality reduction: Rough and fuzzy-rough-based approaches.
IEEE Transactions on Knowledge and Data Engineering, 16(12):1457-1471.

Kessler et al., 1997
Kessler, B., Nunberg, G., and Schütze, H. (1997).
Automatic detection of text genre.
In Proceedings of the 35$^{th}$ ACL/8$^{th}$ EACL, pages 32-38.

Liu et al., 2006
Liu, R., Zhou, J., and Liu, M. (2006).
Graph-based semi-supervised learning algorithms for web page classification.
In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06), pages 856-860.

Markert and Nissim, 2005
Markert, K. and Nissim, M. (2005).
Comparing knowledge sources for nominal anaphora resolution.
Computational Linguistics, 31(3):367-401.

McCarthy et al., 2004
McCarthy, D., Koeling, R., Weeds, J., and Carroll, J. (2004).
Finding predominant word senses in untagged text.
In Proc. 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 279-286, Barcelona.

Meyer zu Eissen and Stein, 2004
Meyer zu Eissen, S. and Stein, B. (2004).
Genre classification of web pages.
In Proceedings of the 27th German Conference on Artificial Intelligence, Ulm, Germany.

Rehm, 2002
Rehm, G. (2002).
Towards automatic web genre identification - a corpus-based approach in the domain of academia by example of the academic's personal homepage.
In Proc. of the Hawaii Internat. Conf. on System Sciences.

Santini, 2007
Santini, M. (2007).
Automatic Identification of Genre in Web Pages.
PhD thesis, University of Brighton.

Sharoff, 2006
Sharoff, S. (2006).
Creating general-purpose corpora using automated search engine queries.
In Baroni, M. and Bernardini, S., editors, WaCky! Working papers on the Web as Corpus. Gedit, Bologna.
http://wackybook.sslmit.unibo.it.

Sharoff, 2007
Sharoff, S. (2007).
Classifying web corpora into domain and genre using automatic feature identification.
In Proc. of Web as Corpus Workshop, Louvain-la-Neuve.

Yang et al., 2002
Yang, Y., Slattery, S., and Ghani, R. (2002).
A study of approaches to hypertext categorization.
Journal of Intelligent Information Systems, 18(2-3):219-241.



Serge Sharoff 2009-05-08