Web Document Classification (WebDoC)

Serge Sharoff

Lecturer, Centre for Translation Studies, School of Modern Languages, University of Leeds, +44-113-3437287, s.sharoff@leeds.ac.uk.

Katja Markert

Lecturer, School of Computing, University of Leeds, +44-113-3435777, markert@comp.leeds.ac.uk.


information retrieval, natural language processing

Google contact

Andy Weissl, Software Engineer, Search Quality, Google Switzerland GmbH, Brandschenkestrasse 110 8002 Zurich, weissl@google.com

Research abstract and goals

The project aims to classify web pages automatically. There are many different kinds of documents on the web, from games to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. A query for `Venice' results in pages of various types, referring to recent news, information about history, guidebooks, hotel lists, opinions about hotels and restaurants, etc. For many applications (language teaching, machine translation, information retrieval and extraction) it is also important to have the possibility of selecting a subcorpus according to specific parameters - such as encyclopedic knowledge vs. instructions, texts written for professionals vs. for the general public, or opinions vs. factual text.

Hand in hand with classifying pages, we need to identify the categories we shall classify them into. The web is new, and this is not an area that has been widely researched to date. We shall adopt an iterative approach by classifying samples of web pages to see which pages fit the existing classification scheme, and amending the scheme to allow for those that do not.

Expected outcomes and results

Our research questions in this project are:

  1. Which features of webpages are useful for their automatic classification?
  2. How language-specific are those features?
  3. How can we build efficient classifiers that operate with minimal human intervention and take into account the special nature of the web as a corpus?
  4. What is the accuracy of automatic classification?

The project will deliver:

  1. a text typology suitable for classifying the majority of existing web pages;
  2. automatic classifiers to assign arbitrary web pages to categories in this typology;
  3. a method for porting the classifiers to a new language;
  4. a set of corpora with documents classified according to the typology to be developed.

We shall work on different language families, so that the method can be shown to be portable to further languages. We will be testing the approach using webpages in English, Chinese, German and Russian.

Technical description

Typology of webpages and features

In this project we will study two categories which are relevant for web pages, are capable of sensible classification of any web page and can be identified reliably:

Some genres can be associated with only some domains, but genres are generally shared between many different domains. For instance, the example of a query for `Venice' shares a set of genres with many other queries about products: news, encyclopedic information, product lists, opinions. The task of the project is to define a set of generalised genre categories that occur frequently across domains. The exact set is to be defined from a corpus of about 500 texts for each language in the project. We already have collections of about 50,000 webpages per language to create a representative sample [Sharoff, 2006]. More pages can be crawled if necessary.

Feature identification

We will identify features that are indicative for each category in the typology. We will consider three types of intratextual features: lexical (keywords, frequency bands, n-grams, lexical density, etc), grammatical (using only generalised features which can be obtained from a POS tagger, such as POS frequencies and n-grams, closed-class words, punctuation), and text statistics (average document or sentence length, markup statistics etc). In addition, we will consider text similarity as expressed both via the textual similarity between two web pages as well as hypertext links.

Our research [Sharoff, 2007] indicates that genres can be identified by POS n-grams, and audience types by lexical density. However, these preliminary findings will have to be tested in the project. The outcome of the feature analysis will also inform further changes necessary in the text typology, if some categories are found to be too difficult to detect automatically.

The resulting typology will maintain the trade-off between reliability (there is little sense in building classifiers for categories that cannot be detected reliably, e.g., the year of text production) and usefulness (some categories are of great interest to web users, even if they are difficult to detect, e.g., the genre).

Feature generalisation and weighting

The output of the feature detection study will be a seed set of feature types. To generalise keywords and address data sparseness we will experiment with similarity-based bootstrapping, in which a seed set of keywords will be generalised using their first and second order co-occurrence features, following the techniques used in distributional similarity [McCarthy et al., 2004]. To reduce the number of features to study, we will experiment with three methods which can detect those features most predictive of a given outcome: singular value decomposition [Berry et al., 1999], rough sets [Jensen and Shen, 2004] and non-redundant clustering [Gondek and Hofmann, 2005].


Regarding classifiers, we will experiment with several types of algorithms.

Supervised algorithms

First, we will use classical supervised machine learning methods such as Naive Bayes and SVM with a focus on intra-textual features to establish a performance baseline.

Graph-based semi-supervised learning

There are two problems with the above standard supervised approach. First, sufficient labeled data is expensive to produce and secondly, it does not take into account the rich hypertext and metadata structure on the Web. To give an example, we would expect that linked web pages are more likely to address the same audience type than web pages that are not linked. We therefore propose graph-based semi-supervised learning techniques such as Minimum Cut [Blum and Chawla, 2001] or label propagation algorithms that aim at a globally optimal classification of interlinked items in contrast to the isolated classification of individual texts in traditional machine learning techniques. These will see the documents of the Web as vertices with edges between them weighted according to link structure as well as textual similarity. We also hypothesize that these graph-based techniques support porting to other languages as (i) a semi-supervised approach limits human annotation for a new language and (ii) the link structure (in contrast to the textual content) is mostly language-independent. [Yang et al., 2002] explore the use of hypertext links for Web classification but they concentrate on topic classification only and also do not use global optimisation. [Liu et al., 2006] use a graph-based semi-supervised learning algorithm successfully, lending support to our approach, but concentrate on a small subcorpus of computer science pages with a clearly established typology only and limit themselves to English.

Joint classification tasks

The methods proposed so far classify a text according to each axis separately, i.e. the classification with regard to audience type is independent from the domain or genre classification. We will investigate joint classification along different axes, looking at joint objective functions.

Supporting different languages

Our framework will support web page classification in different languages with minimal new annotation via several different strategies: (i) semi-supervised learning (see above for a longer discussion); (ii) inclusion of features that are more language-independent than lexical features (such as density and link structure; see above) (iii) use of cross-language information on the Web. This can include links between sites in different languages as well as use of web pages which exist in several languages. As an example, such a web page can be used in collecting cheap training data for a new language or as a classification constraint so that the English and the non-English versions have to be classified with the same label.

Special thanks to Adam Kilgarriff (Lexical Computing) for his support in development of this proposal.


Berry et al., 1999
Berry, M., Drmac, Z., and Jessup, E. (1999).
Matrices, vector spaces, and information retrieval.
SIAM Review, 41(2):335-362.

Blum and Chawla, 2001
Blum, A. and Chawla, S. (2001).
Learning from labeled and unlabeled data using graph mincuts.
In Proceedings of the Eighteenth International Conference on Machine Learning, pages 19-26.

Gondek and Hofmann, 2005
Gondek, D. and Hofmann, T. (2005).
Non-redundant clustering with conditional ensembles.
In Proc. of ACM SIGKDD conference on Knowledge discovery in data mining, pages 70-77.

Jensen and Shen, 2004
Jensen, R. and Shen, Q. (2004).
Semantics-preserving dimensionality reduction: Rough and fuzzy-rough-based approaches.
IEEE Transactions on Knowledge and Data Engineering, 16(12):1457-1471.

Liu et al., 2006
Liu, R., Zhou, J., and Liu, M. (2006).
Graph-based semi-supervised learning algorithms for web page classification.
In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06), pages 856-860.

McCarthy et al., 2004
McCarthy, D., Koeling, R., Weeds, J., and Carroll, J. (2004).
Finding predominant word senses in untagged text.
In Proc. 42nd Meeting of the Association for Computational Linguistics (ACL'04), pages 279-286, Barcelona.

Sharoff, 2006
Sharoff, S. (2006).
Creating general-purpose corpora using automated search engine queries.
In Baroni, M. and Bernardini, S., editors, WaCky! Working papers on the Web as Corpus. Gedit, Bologna.

Sharoff, 2007
Sharoff, S. (2007).
Classifying web corpora into domain and genre using automatic feature identification.
In Proc. of Web as Corpus Workshop, Louvain-la-Neuve.

Yang et al., 2002
Yang, Y., Slattery, S., and Ghani, R. (2002).
A study of approaches to hypertext categorization.
Journal of Intelligent Information Systems, 18(2-3):219-241.

Serge Sharoff 2014-10-22