### Learning structure from sequences, with applications in a digital library (invited lecture for ALT 2002 and DS 2002)

Author: Ian H. Witten

Affiliation: Department of Computer Science, University of Waikato, Hamilton, New Zealand

Abstract. The services that digital libraries provide to users can be greatly enhanced by automatically gleaning certain kinds of information from the full text of the documents they contain. This talk will review recent work that applies novel techniques of machine learning (broadly interpreted) to extract information from plain text. We describe three areas of research: hierarchical phrase browsing, including efficient methods for inferring a phrase hierarchy from a large corpus of text; text mining using adaptive compression techniques, giving a new approach to word segmentation, generic entity extraction, and acronym extraction; and keyphrase extraction and its application in a digital library.

Digital libraries are focused collections of digital objects (text, audio, video etc.---though we focus here on text) along with methods for access and retrieval of the information in the collection, and methods for selecting, organizing, and maintaining it. The gateway to the contents of a bricks-and-mortar library is {\em metadata}---the bibliographic information in the library catalog. Manual cataloging takes one to two hours of expert time per document, well beyond the resources of most digital libraries. All the techniques we discuss here are ways of automatically extracting metadata, broadly defined, from the full text of a document collection.

A hierarchy of the phrases that recur in the text allows readers to browse comfortably through a large digital library. A plausible, easily-understood, hierarchical structure can be created automatically from full text on a purely lexical basis. We describe a simple algorithm for inferring a structural hierarchy of phrases, that operates in time linear in the size of the input---an important consideration for multi-gigabyte collections. A slight variant of the algorithm is universal'' in the sense that it eventually discovers all structure in a finite-state ergodic source. We demonstrate its performance on different kinds of information---letters, words, syntactic tags, graphical primitives. Not only is this technique practically useful for browsing large information collections, but different formulations lead to alternative methods that raise interesting open questions, both theoretical and practical.

Adaptive compression is a powerful tool for eliciting structure from sequences. We describe a novel learning paradigm that creates a model from marked-up training text and applies it to insert markup into plain text. Viterbi-style search is used to correct'' the text by inserting markup in a way that maximizes compression. This strategy yields excellent results on the word segmentation problem---an important practical problem in digital libraries of Chinese text. It has also proved successful for generic entity extraction. Used in a simpler way, a compression-based evaluation metric works well for recognizing acronyms, and users of our digital collections can now browse automatically-extracted lists of acronyms and their definitions.

The third area is keyphrase extraction. Elementary machine learning techniques (supervised discretization, naive Bayes), with a simple two- or three-attribute set, perform well for both domain-independent and domain-dependent extraction of keyphrases. The results, evaluated in an experiment with human judges, are excellent. Interfaces that use keyphrases for browsing provide a new and absorbing environment for reading and writing within a digital library.