Learning structure from sequences, with applications in a digital library
(invited lecture for ALT 2002 and DS 2002)
Author: Ian H. Witten
Affiliation:
Department of Computer Science, University of Waikato, Hamilton, New Zealand
Abstract.
The services that digital libraries provide to users can be greatly enhanced by
automatically gleaning certain kinds of information from the full text of the
documents they contain. This talk will review recent work that applies novel
techniques of machine learning (broadly interpreted) to extract information
from plain text. We describe three areas of research: hierarchical phrase
browsing, including efficient methods for inferring a phrase hierarchy from a
large corpus of text; text mining using adaptive compression techniques, giving
a new approach to word segmentation, generic entity extraction, and acronym
extraction; and keyphrase extraction and its application in a digital library.
Digital libraries are focused collections of digital objects (text, audio,
video etc.---though we focus here on text) along with methods for access and
retrieval of the information in the collection, and methods for selecting,
organizing, and maintaining it. The gateway to the contents of a
bricks-and-mortar library is {\em metadata}---the bibliographic information in
the library catalog. Manual cataloging takes one to two hours of expert time
per document, well beyond the resources of most digital libraries. All the
techniques we discuss here are ways of automatically extracting metadata,
broadly defined, from the full text of a document collection.
A hierarchy of the phrases that recur in the text allows readers to browse
comfortably through a large digital library. A plausible, easily-understood,
hierarchical structure can be created automatically from full text on a purely
lexical basis. We describe a simple algorithm for inferring a structural
hierarchy of phrases, that operates in time linear in the size of the
input---an important consideration for multi-gigabyte collections. A slight
variant of the algorithm is ``universal'' in the sense that it eventually
discovers all structure in a finite-state ergodic source. We demonstrate its
performance on different kinds of information---letters, words, syntactic tags,
graphical primitives. Not only is this technique practically useful for
browsing large information collections, but different formulations lead to
alternative methods that raise interesting open questions, both theoretical and
practical.
Adaptive compression is a powerful tool for eliciting structure from sequences.
We describe a novel learning paradigm that creates a model from marked-up
training text and applies it to insert markup into plain text. Viterbi-style
search is used to ``correct'' the text by inserting markup in a way that
maximizes compression. This strategy yields excellent results on the word
segmentation problem---an important practical problem in digital libraries of
Chinese text. It has also proved successful for generic entity extraction.
Used in a simpler way, a compression-based evaluation metric works well for
recognizing acronyms, and users of our digital collections can now browse
automatically-extracted lists of acronyms and their definitions.
The third area is keyphrase extraction. Elementary machine learning techniques
(supervised discretization, naive Bayes), with a simple two- or three-attribute
set, perform well for both domain-independent and domain-dependent extraction
of keyphrases. The results, evaluated in an experiment with human judges, are
excellent. Interfaces that use keyphrases for browsing provide a new and
absorbing environment for reading and writing within a digital library.
©Copyright 2002 Author and Springer
|