Web Document Classification (WebDoC)
Centre for Translation Studies, School of Modern Languages
School of Computing,
Research abstract and goals
The project aims to classify web pages automatically. There are many
different kinds of documents on the web, from games to shopping pages
to journalism to blogs. Different sorts of page have quite different
uses and characteristics. A query for `Venice' results in pages of
various types, referring to recent news, information about history,
guidebooks, hotel lists, opinions about hotels and restaurants, etc.
For many applications (language teaching, machine translation,
information retrieval and extraction) it is also important to have the
possibility of selecting a subcorpus according to specific parameters -
such as encyclopedic knowledge vs. instructions, texts written for
professionals vs. for the general public, or opinions vs. factual text.
Hand in hand with classifying pages, we need to identify the
categories we shall classify them into. The web is new, and this is
not an area that has been widely researched to date. We shall adopt
an iterative approach by classifying samples of web pages to see which
pages fit the existing classification scheme, and amending the scheme
to allow for those that do not.
The project is supported by a Google Research Award for 2009-2010.
Our research questions in this project are:
- Which features of webpages are useful for their automatic classification?
- How language-specific are those features?
- How can we build efficient classifiers that operate with minimal human intervention and take into account the special nature of the web as a corpus?
- What is the accuracy of automatic classification?
The project will deliver:
- a text typology suitable for classifying the majority of existing web pages;
- automatic classifiers to assign arbitrary web pages to categories in this typology;
- a method for porting the classifiers to a new language;
- a set of corpora with documents classified according to the typology to be developed.
We shall work on different language families, so that the method can
be shown to be portable to further languages. We will be testing the
approach using webpages in English, Chinese, German and Russian.
Typology of webpages and features
In this project we will study two categories which are
relevant for web pages, are capable of sensible classification of any
web page and can be identified reliably:
- generalised genres (information, instruction, opinions, etc).
- audience (general, informed, professional)
Some genres can be associated with only some domains, but genres are
generally shared between many different domains. For instance, the
example of a query for `Venice' shares a set of genres with
many other queries about products: news, encyclopedic information,
product lists, opinions. The task of the project is to define a set
of generalised genre categories that occur frequently across domains.
The exact set is to be defined from a corpus of about 500 texts for
each language in the project. We already have collections of about 50,000
webpages per language to create a representative sample
[Sharoff, 2006]. More pages can be crawled if necessary.
We will identify features that are indicative for each category
in the typology. We will consider three types of intratextual features: lexical
(keywords, frequency bands, n-grams, lexical density, etc),
grammatical (using only generalised features which can be obtained
from a POS tagger, such as POS frequencies and n-grams, closed-class
words, punctuation), and text statistics (average document or sentence
length, markup statistics etc). In addition, we will consider text similarity as expressed both via the
textual similarity between two web pages as well as hypertext links.
Our research [Sharoff, 2007] indicates that genres can
be identified by POS n-grams, and audience types by lexical density.
However, these preliminary findings will have to be tested in the
project. The outcome of the feature analysis will also inform further
changes necessary in the text typology, if some categories are found
to be too difficult to detect automatically.
The resulting typology will maintain the trade-off between reliability
(there is little sense in building classifiers for categories that
cannot be detected reliably, the year of text production) and
usefulness (some categories are of great interest to web users,
even if they are difficult to detect, the genre).
Feature generalisation and weighting
The output of the feature detection study will be a seed set of
feature types. To generalise keywords and address data sparseness we
will experiment with similarity-based bootstrapping, in which a seed
set of keywords will be generalised using their first and second order
co-occurrence features, following the techniques used in
distributional similarity [McCarthy et al., 2004]. To reduce the number of
features to study, we will experiment with three methods which can
detect those features most predictive of a given outcome: singular
value decomposition [Berry et al., 1999], rough sets [Jensen and Shen, 2004] and
non-redundant clustering [Gondek and Hofmann, 2005].
Algorithms
Regarding classifiers, we will experiment with several types of
algorithms.
First, we will use classical supervised machine learning
methods such as Naive Bayes and SVM with a focus
on intra-textual features to establish a performance baseline.
There are two problems with the above standard supervised approach. First, sufficient labeled data is expensive to produce and
secondly, it does not take into account the rich hypertext and metadata structure on the Web. To give an example, we would
expect that linked web pages are more likely to address the same
audience type than web pages that are not linked. We therefore propose graph-based semi-supervised learning techniques such as Minimum Cut [Blum and Chawla, 2001] or label propagation algorithms
that aim at a globally optimal classification of interlinked items in
contrast to the isolated classification of individual texts in
traditional machine learning techniques. These will see the documents of the Web as vertices with edges between them weighted according to link structure as well as textual similarity. We also hypothesize that these
graph-based techniques support porting to other languages as (i) a semi-supervised approach limits
human annotation for a new language and (ii) the link structure (in contrast to the textual content) is mostly language-independent.
[Yang et al., 2002] explore the
use of hypertext links for Web classification but they concentrate on
topic classification only and also do not use global optimisation. [Liu et al., 2006] use a graph-based semi-supervised learning algorithm successfully, lending support to our approach, but concentrate on a small
subcorpus of computer science pages with a clearly established typology only and limit themselves to English.
The methods proposed so far classify a text according to each axis
separately, i.e. the classification with regard to audience type is
independent from the domain or genre classification. We will
investigate joint classification along different axes, looking at joint objective functions.
Our framework will support web page classification in different
languages with minimal new annotation via several different
strategies: (i) semi-supervised learning (see above for a
longer discussion); (ii) inclusion of features that are more
language-independent than lexical features (such as density and
link structure; see above) (iii) use of cross-language information on
the Web. This can include links between sites in different languages as well as
use of web pages which exist in several languages.
As an example, such a web page can be used in collecting cheap training data for a new language or as a classification constraint so that the English and
the non-English versions have to be classified with the same label.
Automatic text classification by topic has a long history. Text
classification according to genre (or or text type) is a much
newer field, with far fewer well-established methods and results.
Likewise, the discovery and analysis of new web genres.
To date there has been
only the most prelimanry work at this intersection, mostly limited to
one language, namely English.
Until recently there was very little work, under either the `document
classification' or the `web as corpus' headings, which uses
linguistically-informed notions of genre. [Kessler et al., 1997] explore
classification according to genre, but not on the web, and
[Dillon and Gushrowski, 2000] map out new web genres and how they may be used to
improve searching. More recent work in genre classification includes
[Finn and Kushmerick, 2006,Rehm, 2002,Meyer zu Eissen and Stein, 2004]. In a similar way, there has been
some research on automatic authorship and audience detection,
[Baroni and Bernardini, 2006,Heilman et al., 2008]. However, all of this work
is at an early stage, with taxonomies of genres fluid, limited
coverage of the complete set of webpages (homepages only) and no
common criteria for genre annotation [Santini, 2007].
- Baroni and Bernardini, 2006
-
Baroni, M. and Bernardini, S. (2006).
A new approach to the study of translationese: Machine-learning the
difference between original and translated text.
Literary and Linguistic Computing, 21(3):259-274.
- Berry et al., 1999
-
Berry, M., Drmac, Z., and Jessup, E. (1999).
Matrices, vector spaces, and information retrieval.
SIAM Review, 41(2):335-362.
- Biemann, 2007
-
Biemann, C. (2007).
Unsupervised and Knowledge-Free Natural Language Processing in
the Structure Discovery Paradigm.
PhD thesis, University of Leipzig.
- Blum and Chawla, 2001
-
Blum, A. and Chawla, S. (2001).
Learning from labeled and unlabeled data using graph mincuts.
In Proceedings of the Eighteenth International Conference on
Machine Learning, pages 19-26.
- Dillon and Gushrowski, 2000
-
Dillon, A. and Gushrowski, B. A. (2000).
Genres and the web: is the personal home page the first uniquely
digital genre?
J. Am. Soc. Inf. Sci., 51(2):202-205.
- Finn and Kushmerick, 2006
-
Finn, A. and Kushmerick, N. (2006).
Learning to classify documents according to genre.
Journal of the American Society for Information Science and
Technology, 57(11).
- Gondek and Hofmann, 2005
-
Gondek, D. and Hofmann, T. (2005).
Non-redundant clustering with conditional ensembles.
In Proc. of ACM SIGKDD conference on Knowledge discovery in data
mining, pages 70-77.
- Heilman et al., 2008
-
Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008).
An analysis of statistical models and features for reading difficulty
prediction.
In Proc the Third Workshop on Innovative Use of NLP for
Building Educational Applications, pages 71-79, Columbus, Ohio. ACL.
- Jensen and Shen, 2004
-
Jensen, R. and Shen, Q. (2004).
Semantics-preserving dimensionality reduction: Rough and
fuzzy-rough-based approaches.
IEEE Transactions on Knowledge and Data Engineering,
16(12):1457-1471.
- Kessler et al., 1997
-
Kessler, B., Nunberg, G., and Schütze, H. (1997).
Automatic detection of text genre.
In Proceedings of the 35
ACL/8
EACL, pages
32-38.
- Liu et al., 2006
-
Liu, R., Zhou, J., and Liu, M. (2006).
Graph-based semi-supervised learning algorithms for web page
classification.
In Proceedings of the Sixth International Conference on
Intelligent Systems Design and Applications (ISDA'06), pages 856-860.
- Markert and Nissim, 2005
-
Markert, K. and Nissim, M. (2005).
Comparing knowledge sources for nominal anaphora resolution.
Computational Linguistics, 31(3):367-401.
- McCarthy et al., 2004
-
McCarthy, D., Koeling, R., Weeds, J., and Carroll, J. (2004).
Finding predominant word senses in untagged text.
In Proc. 42nd Meeting of the Association for Computational
Linguistics (ACL'04), pages 279-286, Barcelona.
- Meyer zu Eissen and Stein, 2004
-
Meyer zu Eissen, S. and Stein, B. (2004).
Genre classification of web pages.
In Proceedings of the 27th German Conference on Artificial
Intelligence, Ulm, Germany.
- Rehm, 2002
-
Rehm, G. (2002).
Towards automatic web genre identification - a corpus-based approach
in the domain of academia by example of the academic's personal homepage.
In Proc. of the Hawaii Internat. Conf. on System Sciences.
- Santini, 2007
-
Santini, M. (2007).
Automatic Identification of Genre in Web Pages.
PhD thesis, University of Brighton.
- Sharoff, 2006
-
Sharoff, S. (2006).
Creating general-purpose corpora using automated search engine
queries.
In Baroni, M. and Bernardini, S., editors, WaCky! Working papers
on the Web as Corpus. Gedit, Bologna.
http://wackybook.sslmit.unibo.it.
- Sharoff, 2007
-
Sharoff, S. (2007).
Classifying web corpora into domain and genre using automatic feature
identification.
In Proc. of Web as Corpus Workshop, Louvain-la-Neuve.
- Yang et al., 2002
-
Yang, Y., Slattery, S., and Ghani, R. (2002).
A study of approaches to hypertext categorization.
Journal of Intelligent Information Systems, 18(2-3):219-241.
Serge Sharoff
2009-05-08