Editors' Introduction

Learning theory is a research area involved in the study, design, and analysis of computer programs that are able to learn from past experiences. Over the last years the idea of learning system has independently emerged in a variety of fields: pattern recognition, theory of repeated games, machine learning, universal prediction and data compression, inductive inference, adaptive optimal control, computational psychology, and others. This is not surprising: on the one hand, the notion of adaptivity is an extremely attractive and powerful tool; on the other hand, the abstract phenomenon of learning is so complex that it is unlikely that a single theory will ever be able to explain it in a fully satisfactory way. Despite their apparent diversity, these learning models present deep connections whose study is an active research area in learning theory.

Another important aspect is the increasingly tight interaction between the theoretical analysis and the practical application of learning. The sharpening of the theoretical tools for designing and analyzing learning techniques, the large amounts of real-world data freely available in digital form, and the growing demand of adaptive techniques for e-commerce, computational biology, and other applications, have stimulated the development of new learning algorithms, some of which are successfully used in commercial and scientific applications.

The papers collected in this volume offer a broad view on the current research in the field, including studies on the most important learning models (inductive inference, statistical learning, large-margin learning techniques, learning with queries, reinforcement learning). The scope of this research is not limited to theory. It also includes algorithm design and analysis, experimental studies of learning heuristics, and application of learning to engineering problems. Moreover, there is an increasing need to learn from semi-structured data that are typically available within the world-wide-web.

The invited lecture for ALT 2002 and DS 2002 by Ian H. Witten presents a solution to the problem of automatically extracting metadata for text documents stored in digital libraries. Given that such libraries are usually huge and rapidly expanding, assigning meaningful metadata to the documents stored is crucial for the usability of digital libraries. In his paper, Witten describes a learning paradigm that creates a model from marked-up training text. Then, this model is applied to insert markup into plain text. As a result a plausible hierarchical structure of metadata can be created automatically. Surprisingly, his algorithm works in time linear in the size of the input.

In his invited lecture, Susumi Hayashi develops a mathematics that is based on learning. Traditionally, learning theory is mainly concerned with the learnability of mathematical and logical objects from a finite amount of data. Thus, the main goal is to achieve a better understanding of learning. In contrast, Susumi Hayashi proposes a mathematics that is founded on learning theoretical notions. This research started and is ultimately aimed at testing and debugging formal proofs. Interestingly, when the investigations started, learning theory was not involved at all and at the current stage of investigations it is indispensable,

The next invited lecture, given by John Shawe-Taylor and co-authored by Chris Williams, Nello Christianini, and Jaz Kandola, studies the eigenspectrum of the Gram matrix and its relationship to the operator eigenspectrum. It is shown that the difference between the two spectra can be bounded. Moreover, a performance bound on kernel principal component analysis is provided. These results are fundamental for a number of learning algorithms. These algorithms project data into a low dimensional subspace that hopefully will catch most of the variance in the data. Experiments are presented that confirm the theoretical predictions on a real world data set for small projection dimensions.

Data mining with graphical models is presented in the invited lecture for DS 2002 by Rudolf Kruse (co-authored by Christian Borgelt). Graphical models are an important technique for classifier construction and dependency analysis. The paper discusses the main ideas of learning graphical models from datasets of sample cases with an emphasis on probabilistic networks.

The invited lecture for DS 2002 by Gerhard Widmer deals with the complex phenomenon of expressive music performance. It is a big challenge to capture this phenomenon by using machine learning techniques and automated discovery methods. The lecture reports the results achieved and identifies a number of important research problems.

Next, we turn our attention to the regular contributions contained in this volume. The first group of papers is concerned with inductive inference. This line of research was pioneered by Gold who introduced his model of learning in the limit [3]. In this model the learner receives a stream of examples of a certain target function and is required to output a sequence of hypotheses converging to this target. The paper by Jain, Menzel, and Stephan investigates whether target classes that are learnable in the limit contain non-trivial subclasses that can be learned in a harder learning model. A related important question, also investigated in this paper, is whether for such target classes there are learners for the full class that are able to learn targets in the subclass under the harder model, too.

The paper by Grieser develops the notion of reflective inductive inference. Reflective learners are able to detect the fact that the sequence of examples contains inconsistent information, but occasionally they may also make detection mistakes. The results determine how learners obeying to different notions of reflection behave in different learning models. It turns out that for some learning models changes in the notion of reflection do not change the learning power, whereas in others the learning power is greatly affected by these changes.

The work by Martin, Sharma, and Stephan goes towards providing a unified view over logic, learning, and topology. The outcome is a notion of parameterized logic, based on topology, that makes use of learning-theoretic notions for its interpretation and application. The framework is also seen to provide a nice interpolation between applications based on induction and deduction. Finally, the paper by Harizanov and Stephan establishes a number of fundamental results relating the learnability properties and algebraic properties of recursive vector spaces. The main result states that for identification in the limit from positive examples only, the learnability properties can be characterized by algebraic properties, whereas for learning from positive and negative examples this is not true anymore.

The second group of papers is concerned with the computationally efficient statistical learning of Boolean functions, a field initiated in theoretical computer science by Valiant who introduced the PAC learning model [4]. In this model examples of the target Boolean function are drawn from an unknown probability distribution, and the learner is required to approximate the target function to within a given accuracy. The paper by Amano and Maruoka deals with the important class of monotone boolean functions. They propose a learning algorithm for weakly PAC learning under the uniform distribution obtaining a lower bound on the accuracy of the weak hypotheses generated.

The paper by Servedio studies the PAC learnability of another important class of Boolean functions: those representable by constant-depth, polynomial-size circuits of unbounded fan-in (the ACC class). A first result relates the efficient learnability of ACC to that of a different class: the embedded midbit functions. Several positive and negative results on the efficient learnability of this latter class, providing important insights on the learnability of ACC, are shown for the PAC model and some of its restrictions.

The last paper in this group is by Bshouty and Burroughs. The main theme is the proof of impossibility results for efficient learning of several subclasses of Boolean functions in an extension of PAC learning called co-agnostic learning. The key proof technique is the construction of reductions between learning problems and well-studied approximability problems for which there exist known hardness results.

The third group of papers applies learning to the problem of information extraction. Information extraction is a rapidly growing applied research area whose goal is the design of systems able to extract meaningful information from structured, semi-structured, and unstructured data sources. Important subareas of information extraction are database mining, web mining, and text analysis.

The paper by Grieser, Jantke, and Lange applies theoretical results from inductive inference to perform wrapper induction; i.e., selecting interesting fragment of web pages (originated from a given source) on the basis of a few learning examples. The paper describes the theoretical framework together with a functioning system called LExIKON.

Suzuki, Shoudai, Uchida, and Miyahara consider algorithms for the efficient learning of ordered tree structures. This rich class of structures is well suited for modeling the parse trees of semi-structured data such as HTML or XML documents.

A further set of paper is concerned with a model where the learning algorithm is allowed to ask certain queries to a teacher. The learning performance is measured by the number and type of queries needed by the learner to identify the target concept. This model of learning with queries was introduced by Angluin [1].

Two papers by Köbler and Lindner present results that characterize the computational complexity of query learning for concept classes satisfying certain structural requirements based on the notion of general dimension. The main focus of the first paper is to understand when learning with a few queries (of a given type) implies learning in a reasonable amount of time. The second paper uses the general dimension to characterize the complexity of learning using queries whose response is a random variable and when the goal is relaxed to the approximate identification of the target concept.

The last paper in this group is by Kalnishkan and Vyugin and investigates the notion of predictive complexity. This quantity is an extension of the Kolmogorov complexity of individual sequences and may be viewed as a measure of learnability of a given data sequence abstracted from any specific learning strategy. Predictive complexity is parameterized by the loss function used to measure the success of learning. The main result of the paper characterizes the set of losses admitting the existence of predictive complexity.

The next group contains five papers belonging to the area of statistical learning. This is a very active research field that has started with the seminal works of Vapnik and others in the seventies on the theory of pattern recognition [5],[6]. Statistical learning may be viewed as an extension of PAC learning where, however, computational complexity issues are disregarded.

The work by Schmitt applies ideas derived from the Descartes' rule of signs to derive upper and lower bounds on the Vapnik-Chervonenkis dimension and pseudodimension of radial basis function networks (a popular family of neural networks) with one input variable.

In the next paper, Vovk studies region predictors. These are on-line learning algorithms for pattern classification parameterized by a confidence value. Whereas usual pattern classification algorithms output a single label as prediction, region predictors output a subset of the possible labels, and the size of the subset may increase as the confidence parameter approaches 1.

The paper by Dasgupta, Pavlov, and Singer investigates provably correct and efficient algorithms for reconstructing images. They define a noisy sampling model for images consisting only of lines in the plane and present an algorithm that reconstructs these lines with high probability.

A method for transforming any multi-class, multi-label, or ranking learning problem into a binary classification problem with linear hypotheses is proposed in the paper by Har-Peled, Roth, and Zimak.

The last paper in this group, by Braess, Forster, Sauer, and Simon, has an information-theoretic flavor: it analyzes the problem of learning a probability distribution over a finite domain of fixed size using the Kullback-Leibler divergence as cost function.

The papers in the next group are concerned with inductive logic programming. In this framework, the goal is to learn concepts using logical formulae (propositional or first-order) as hypotheses. In many cases this is a difficult problem, but the payoff is that logical formulae provide solutions that are generally more understandable than those provided by other classes of functions (like neural networks).

In the first paper, Fürnkranz considers the problem of rule learning using a covering approach; i.e., finding a small set of simple rules that explains (covers) all the positive examples and none of the negative examples. In particular, his work points out a problem in covering strategies that proceed bottom-up (i.e., by generalizing a rule that initially covers a single example) and proposes some techniques to alleviate this problem.

In the second and last paper of this group, Fronhöfer and Yamamoto study the notion of relevant logic. A relevant logic is a logic where a formula may be derived as conclusion only if it is deductively related to all the premises. The main idea of the paper is to use relevance as a notion of appropriateness for hypotheses in inductive logic.

Large-margin learning algorithms, like support vector machines, boosting and their variants, build on results from statistical learning theory to generate linear hypotheses with good generalization abilities. Algorithms of this kind are studied in the next group of papers.

The first one is by Gavinsky who proposes a new boosting method. Boosting [2] is a general technique for obtaining a highly accurate hypothesis by combining hypotheses generated on different reweightings of the training set. Good boosting algorithms are adaptive as they work without need to know any a priori lower bound on the accuracy of the individual hypotheses to combine. The price of adaptivity is lack of robustness: traditional boosting is likely to work not so well on noisy data. On the other hand, this new boosting method is the first one able to provide adaptivity together with a certain noise robustness.

The second paper in this group is by Kivinen, Smola, and Williamson and investigates two incremental variants of the support vector machine algorithm. The main result shows that these variants are robust with respect to drift of the target. The performance of both algorithms is shown to degrade in proportion to the amount of drift in the considered time interval and also in proportion to the amount of nonlinearity in the data sequence.

The third paper, by Forster and Simon, is concerned with some fundamental questions about the use of linear-threshold hypotheses. Statistical learning theory ensures that a good generalization error is achieved whenever the sample can be embedded in a space where some linear hypothesis can be found that separates positive examples from negative ones with large margin. The paper proves bounds relating the spectrum of a matrix derived from the sample to the margin which can be obtained via the embedding technique.

The last group of papers is devoted to heuristics and applications of learning to concrete problems. Lindgren and Boström propose an alternative approach to rule-based learning techniques where it may happen that many contradicting rules are applicable to a single example. In these cases, one may weigh each rule by the number of examples covered in the sample or, alternatively, assume rule independence in a naive Bayes fashion. The solution proposed in this paper is to improve naive Bayes by taking into account the examples in the intersecting regions of the overlapping rules. The paper also provides experimental validation of the approach proposed.

The last paper in this group is by Coulom and applies neural networks to the solution of a motor control problem. The control problem is modeled using reinforcement learning, a very general framework formalizing the learning of plans and complex strategies. The proposed approach is based on the temporal difference learning algorithm coupled with feedforward neural networks. Empirical convergence of the resulting algorithm is shown via simulation of a mechanical system consisting of five articulated segments.

References

D. Angluin.
Queries and concept learning.
Machine Learning, 2(4):319 - 342, 1988.
Y. Freund and R.E. Schapire.
A decision-theoretic generalization of on-line learning and an application to boosting.
Journal of Computer and System Sciences, 55(1):119--139, 1997.
E.M. Gold.
Language identification in the limit.
Information and Control, 10:447--474, 1967.
L. Valiant.
A theory of the learnable.
Communications of the ACM, 27(11):1134--1142, 1984.
V. N. Vapnik.
Estimation of Dependences Based on Empirical Data
Springer, 1982.
V. N. Vapnik.
The Nature of Statistical Learning Theory
Springer Verlag, 1999. 2nd edition

September 2002 Nicolò Cesa-Bianchi

Masayuki Numao

Rüdiger Reischuk

©Copyright Notice:
The document of this page is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provision of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under German Copyright Law.

September 2002	Nicolò Cesa-Bianchi
	Masayuki Numao
	Rüdiger Reischuk