Editors' Introduction

“Learning” is a complex phenomenon that is studied in different scientific disciplines. A computer program with the ability to “learn” contains mechanisms for gathering and evaluating information and, consequently, for improving its performance. Algorithmic Learning Theory provides a mathematical foundation for the study of learning programs. It is concerned with the design and analysis of learning algorithms. The analysis proceeds in a formal model such as to provide measures for the performance of a learning algorithm or for the inherent hardness of a given learning problem.

The variety of applications for algorithms that learn is reflected in the variety of formal learning models. For instance, we can distinguish between a passive mode of “learning from examples” and active modes of learning where the algorithm has more control about the information that is gathered. As for learning from examples, a further decision is whether we impose statistical assumptions on the sequence of examples or not. Furthermore, we find different success criteria in different models (like “approximate learning” versus “exact learning”).

The papers collected in this volume offer a broad view on the current research in the field including studies on several learning models (such as kernel-based learning, pac-learning, query learning, inductive inference, learning from expert advice, online learning and teaching). It comes without saying that these models are getting more and more refined as a response to new challenges in computer science (like taking advantage of the large amount of real-world data that become available in digital form, or like serving the growing need for adaptive techniques in the management and control of Internet applications).

The volume is structured as follows. It first presents the invited lectures and then the regular contributions. The latter are grouped according to thematic categories. In the remainder of the introduction, we provide the reader with a rough “road map”.

The invited lecture for ALT 2005 and DS 2005 by Gary Bradshaw compares Invention and Artificial Intelligence and points out some analogies. Starting from two case studies, the invention of the air-plain (1799--1909) and the invention of a model rocket by a group of high school students in rural West Virginia in the late 1950's, it is argued that general principals of invention may be applied to expedite the development of AI systems.

The invited lecture for DS 2005 by Neil R. Smalheiser gives a report on the current status of the Arrowsmith Project. The roots of this project trace back to the 1980's when Don Swanson proposed the concept of “undiscovered public knowledge” and published several examples in which two disparate literatures held complementary pieces of knowledge that, when brought together, made compelling and testable predictions about potential therapies for human disorders. In continuation of this work in the 1990's, Smalheiser and Swanson created a computer-assisted search strategy (“Arrowsmith”). The lecture reviews the development until today.

Ross D. King, presenting the second invited lecture for DS 2005, pursues the question whether it is possible to automate the scientific process. This question is of increasing importance because, in many scientific areas, data are being generated much faster than they can be effectively analyzed by humans. In his lecture, King describes a physically implemented robotic system that applies techniques from artificial intelligence to carry out cycles of scientific experimentation. This is fleshed out by describing the application of the system to complex research problems in bioinformatics.

The invited lecture for ALT 2005 by Vasant Honavar (co-authored by Doina Caragea, Jun Zhang, Jie Bao, and Jyotishman Pathak) is concerned with the problem of data-driven knowledge acquisition and decision-making. Honavar describes the hurdles represented by massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories. He introduces some of the algorithmic and statistical problems that arise in such a setting and presents algorithms for learning classifiers from distributed data that offer rigorous performance guarantees relative to their centralized counterparts. The lecture furthermore presents techniques that help to cope with the problem of semantic heterogeneity.

The second invited lecture for ALT 2005 by Chih-Jen Lin (co-authored by Pai-Hsuen and Rong-En Fan) concerns the convex quadratic optimization problem that has to be solved by Support Vector Machines (SVMs). Since the underlying cost matrix is high-dimensional and dense, this cannot be done by classical methods (like, for example, “Interior Point”). Instead a decomposition method is used. It proceeds iteratively and has, in each iteration, to solve a small-dimensional subproblem. Lin focuses on decomposition methods of the SMO-type and elaborates how the implementation of the procedure for “Working Set Selection” affects the speed of convergence to the optimal solution. In a sense, Lin's lecture offers the opportunity of looking deeply inside a widely used learning machine.

We now turn our attention to the regular contributions contained in this volume.

Kernel-based Learning: During the last decade there has been a lot of interest in learning systems that express the “similarity” between two “instances” as an inner product of vectors in an appropriate feature space. The inner product operation is often not carried out explicitly, but reduced to the evaluation of a so-called kernel function which operates on instances of the original space. This offers the opportunity to handle high-dimensional feature spaces in an efficient manner. This strategy, introduced by Vapnik and co-workers in connection with the so-called Support Vector Machine, is a theoretically well founded and very powerful method that, in the years since its introduction, has already outperformed many other learning systems in a wide variety of applications.
Gretton, Bousquet, Smola and Schölkopf continue a line of research where the kernel-based approach is used as a tool for the detection of statistical dependencies. More precisely, they provide a measure of statistical dependence between random variables based on the Hilbert-Schmidt norm of a cross covariance operator. Their method has some theoretically appealing features while being experimentally competitive to existing methods.
Kowalczyk and Chapelle study the surprising phenomenon of “anti-learning”, where data sets are represented in such a way that some well-known standard learning techniques lead to hypotheses performing much worse than random guessing. While there seem to exist “real-life” data sets which are of that type, the authors consider artificial data whose analysis sheds some light on this (at first glance) counterintuitive phenomenon. They identify some abstract properties of a given positive definite kernel which inevitably lead to anti-learning (for some standard learning techniques including “large-margin classification” or “nearest neighbour”). They furthermore explain which kernel-transformations convert “anti-learning” into “learning” and which do not.

Bayesian and Statistical Models: Causal networks are graphical representations for “causality” between random variables. Like Bayesian networks or other graphical models they represent knowledge being not as strict as formal logical statements but being quite helpful for what is called “reasoning under uncertainty”. Bayesian Inference leads to decisions that minimize a risk function. Bayesian learning refers to the problem of inferring the unknown parameters of a distribution (chosen from a known parameterized class of distributions). Typically the a priory distribution for the unknown parameters gives support to a wide range of parameters, whereas the a posteriori distribution is peaked around the true parameter values.
By means of statistical information we can determine causal networks only up to Markov-equivalence. An equivalence class can be represented by a chordal chain graphs that contains directed and undirected edges. Each undirected edge represents two mutually correlated variables where we cannot know whether causality (if any) goes in one or the other direction. He, Geng and Liang describe a hierarchical partitioning of a class of Markov-equivalent causal networks (given by a chordal chain graph) into finer and finer subclasses up to the point where the partition consists of the single causal networks. They prove that, at each stage of the refinement process, an equivalence class can be again represented as a chordal chain graph.
Variational Bayesian Learning results from Bayesian Learning by introducing a simplifying assumption (in case there are hidden variables) that makes the approach computationally more tractable. Empirically it is known to have good generalization performance in many applications. Watanabe and Watanabe provide some additional theoretical support by proving lower and upper bounds on the stochastic complexity in the Variational Bayesian learning of the mixture of exponential families.
As for classification tasks, the Bayes classifier chooses the label that has the maximum a posteriori probability. In order to apply the formula of Bayes, we have to know the a priori probability and (an approximation of) the class-conditional densities. Thonangi and Pudi suggest a new method for the approximation of the class-conditional densities: they use a class-conditional probability model of the data (based on certain conjunctions of binary features and the corresponding relative frequencies) and determine the density function of maximum entropy that satisfies this model.

PAC-Learning: In the PAC-learning model, the learner receives as input training examples, drawn at random according to an unknown distribution and labeled according to an unknown target function ƒ, and returns a “hypothesis” h that (with high probability of success) is a close approximation of ƒ. While the first papers on pac-learning focussed on binary classification problems with the probability of misclassification as an underlying pseudo-metric, there have been many extensions of the basic model since then.
Fakcharoenphol and Kijsirikul revisit the well-studied problem of converting base binary learners (coping with binary classification problems) into multi-class learners. They reconsider various classical constructions, including “one-versus-all” and the “(adaptive) directed acyclic graph approach”, and come up with some new generalization error bounds.
The paper by Ryabko relates “PAC-learnability” and “computability” by showing the following result: the number of examples needed by a recursive PAC-learner to learn a computable function cannot be upper-bounded by any computable function. The result is obtained by applying a negative result on data compression based on Kolmogorov complexity.
The papers by Palmer and Goldberg and by Guttman, Vishwanathan and Williamson are both dealing with so-called Probabilistic Deterministic Finite State Automata (PDFA). A PDFA, in contrast to a DFA (its deterministic counterpart), performs random state transitions and thus represents a probability distribution over strings. In a recent paper from 2004, it was shown by Clark and Thollard that the class of PDFAs is polynomially pac-learnable where the polynomial (which bounds sample size and run-time) depends on the size of the target PDFA, a measure for the “distinguishability” of states and the expected length of strings generated from any state. Their pseudo-metric (measuring the distance between target distribution and the hypothesis) is the KL-divergence. By passing from KL-divergence to the variation distance (still useful for pattern classification tasks), Palmer and Goldberg are able to considerably simplify the proof for PAC-learnability and to remove the dependence on the expected string length. Guttman, Vishwanathan and Williamson extend the result by Clark and Thollard into another direction by generalizing the notion of “distinguishability”.

Query-Learning: In the query-learning model, the learner is allowed to ask certain queries to an oracle. The learning performance is measured by the number and type of queries needed to exactly (sometimes approximately) identify an unknown target concept ƒ (where “concept” means “binary-valued function”). Among the most popular query types are “membership queries (MQs)” (asking for the class label of an instance) and “equivalence queries (EQs)” (asking for a “counterexample” to the current hypothesis). Statistical queries can be used to get a statistical estimate for the probability of an event that involves a labeled random example. This query type plays a special role because an algorithm that learns by means of statistical queries can be converted into a noise-tolerant pac-learning algorithm.
Bennet and Bshouty investigate the possibility of learning “attribute-efficiently” (such that the time-bound grows only sub-linearly with the number of irrelevant attributes) with “corrupted oracles” (whose answers may occasionally be wrong or missing). Their main result is that an attribute-efficient learner expecting a perfectly reliable oracle of type MQ or EQ can be converted into an attribute-efficient learner that copes with corrupted oracles. (This extends a result from FOCS 2004, where attribute-efficiency had not been an issue.)
As shown by Köbler and Lindner in their paper presented at ALT 2002, DNF-formulas can be learned by statistical queries under the uniform distribution. The paper by Lindner in this volume improves on this result by showing that the learning algorithm can cope with larger inaccuracies of the empirical estimates provided by the oracle. A similar remark holds for the related model of “learning by proper distance queries”.
Kato, Matsumoto and Miyahara discuss the learnability of certain logic programs, called “Elementary Formal Systems (EFS)”, that represent formal languages. These programs can, for instance, succinctly represent the popular “regular pattern languages” whose learnability has been well studied in the past. It is shown that EFS-definable languages can be learned in polynomial time with membership and superset queries. Furthermore, it is shown that various other combinations of query types are insufficient for this purpose.
Jain, Lange and Zilles extend work by Lange and Zilles from COLT 2004 and ALT 2004. The latter papers revealed surprising relations between Gold-style and query learning of indexable classes of recursive languages. In their contribution to this volume, they examine (arbitrary) classes of recursively enumerable languages and analyze which of the relations still hold. It turns out that, although some of the relations are lost, there is still a nice hierarchy with several cross-connections between Gold-style and query learning.

Inductive Inference: In Gold's model of “learning in the limit from positive data”, the learner should identify an unknown recursive (or recursively enumerable) target language L (where L belongs to a class C that is known to the learner). To this end, it receives a “text” (basically an infinite list of the words in L). It keeps track of a (description of a) “hypothesis” that is updated after the reception of the next word from the text. We say that C is learned in the limit, if the learner is able to identify every language L ∈ C after finitely many steps. Within this model, one distinguishes between “explanatory learning” (learner converges to one description of the target concept) and “behaviourally correct learning” (learner converges to the target concept but may, in principle, use arbitrarily many descriptions for it). Between these two extremes, we find the possibility of converging to the target and using (in the limit) at most k different descriptions. This leads to what is called “vacillatory learning” (actually an infinite hierarchy of models that is in-between explanatory and behaviourally correct learning). Another notion of interest is “U-shaped” versus “Non U-shaped” learning, where a learner exhibits U-shaped behaviour if it passes from the correct hypothesis to an incorrect-one, but later returns to the correct-one. This kind of learning behaviour is also discussed in the cognitive psychology literature.
Carlucci, Case, Jain and Stephan continue a line of research that explores whether U-shaped learning behaviour is logically necessary within the model of learning in the limit from positive data. It was known that U-shaped learning behaviour is, indeed, necessary for “behaviourally correct learning” but not for “explanatory learning”. The authors refine this result in several ways. They show that non U-shaped vacillatory learning collapses to explanatory learning. Moreover, a U-shaped learner on the second level of the hierarchy can be transformed into a non U-shaped behaviourally correct learner, whereas a transformation of this kind is not in general possible if the U-shaped learner is located on the third (or higher) level of the hierarchy.
Jain and Kinber consider the following variant of learning in the limit from positive data: given a text for a union of n (pairwise disjoint) languages, find in the limit m grammars such that each of them generates some union of the n languages and every of the n languages occurs in exactly one of these unions. The paper provides considerable insight in this model.

Language Learning: The next group of papers is concerned with learning languages in the limit from positive data. It is known, for instance, that bounded unions of non-erasable pattern languages and erasable regular pattern languages are learnable in this model whereas regular languages are not. Furthermore, a learning strategy that selects a language from the version space that is minimal with respect to inclusion ensures identification in the limit provided that each language L in the target class has a characteristic set (i.e., a finite subset S with L as the unique minimal language from the target class that contains S).
Ng and Shinohara introduce another notion of minimality that refers to the number of elements in a language being shorter than a specified length. They show that this notion of minimality and a suitably adapted notion of a characteristic set fit the same purpose for learning languages in the limit from positive data as the classical notions (with a range of applicability that has not become smaller). Thus, the selection of a language from the version space that is minimal in the new sense is an interesting alternative to the selection of an inclusion-minimal language.
Clark and Eyraud consider the class of substitutable context-free languages (building on the notion of substitutability by Harris from the 1950's). They show that every substitutable context-free language has a characteristic set, which implies their learnability in the limit from positive data.
Fernau identifies a non-trivial subclass of regular languages (representable by special regular expressions called “simple looping expressions” in the paper) and presents an algorithm that has a polynomial update time and learns simple looping expressions in the limit from positive data.

Learning and Logic: The following group of papers again deals with the model of learnability in the limit from positive data. The target classes under consideration are Prolog programs and logical structures, respectively.
The paper by Rao continues with the research project of identifying fragments of Prolog that are learnable in the limit from positive data. The new fragment studied in the paper (“recursion bounded” Prolog programs) is incomparable with the previously identified fragments and contains programs that neither belong to the class of “linearly-moded programs” (introduced by Rao himself) nor to the class of “safe programs” (introduced by Martin and Sharma).
Jain, Martin and Stephan discuss the problem of learning logical structures in the limit from positive examples, where the positive examples are logical formulas that are satisfied by the unknown target structure. They prove learnability in this model except for a set of logical structures of measure zero.

Learning from Expert Advice: In the classical model of learning an unknown binary sequence from expert advice, the learner predicts the sequence bitwise such that learning proceeds in rounds. In every round, it first receives N expert predictions p₁, ... ,p_N∈[0,1]. Then, it makes its own prediction p∈[0,1] (in dependence of the experts predictions). At the end of the round, it receives the next bit b of the sequence and suffers loss |p−b| (the probability of making a wrong prediction when 1 is chosen with probability p). The learner is exposed to a worst-case analysis (like playing against an intelligent adversary) in an “agnostic” setting (where any bit sequence is conceivable). The goal of the learner is to keep the difference between its own total loss and the total loss of the best expert as small as possible. This difference is called the “regret” of the learner. Many version of this basic model are found in the literature. The goal within the theoretical analysis is finding tight bounds on the smallest possible regret.
The paper by Harada, Takimoto and Maruoka studies the problem of sequentially assigning a probability vector p over k options. Then a cost vector c over the k options is revealed and the learner suffers loss p^Tc (cost averaged over p). Vector p is chosen in dependence of (an average of) N probability vectors that represent the “expert advice” in this setting. The authors generalize the known performance bounds for two well-known algorithms (Hedge Algorithm and Aggregating Algorithm) to the case where each available option has different range of possible losses. As mentioned in the paper, the setting can be applied to the problem of finding a route with minimum total expected delay through a network.
The paper by Poland and Hutter and the paper by Kujala and Elomaa both study a variation of a problem that was introduced by Kalai and Vempala at COLT 2003. The latter authors consider a broad class of online optimization problems that contains “learning from expert advice” as a special case. They furthermore present and analyze a strategy named “follow the perturbed leader”. Poland and Hutter extend this work in the bandit setting to the case of a countably infinite pool of experts, and to the case in which, instead of uniform upper bounds on the cost in each round, there are (slowly) increasing bounds. Kujala and Elomaa likewise consider the framework of Kalai and Vempala. They apply the strategy of Mc~Mahan and Blum from COLT 2004 (which is designed such as to cope with an adaptive adversary) to the more restricted case of an oblivious adversary. This leads to stronger regret bounds.
The paper by Henderson Shawe-Taylor and Žerovnik generalizes the results in the standard model of learning a binary sequence with expert advice to the multi-dimensional case of a binary vector sequence.

Online Learning: In the classical model of “online learning”, there is a class of target functions that is fixed in advance. Learning proceeds in rounds. In every round, the learner first receives an instance x, then makes a prediction y′ and, at the end of the round, receives the true label y=ƒ(x) and suffers loss l(y′,y). In case of a Boolean target class, a natural choice for l is the zero-one loss (such that the accumulated loss counts the number of mistakes). Many versions of this basic model are found in the literature. The goal within the analysis is finding the best possible loss bounds (or mistake bounds, respectively) for a given class of target functions.
The paper by Mesterharm extends the standard worst-case online learning framework by allowing delayed label feedback. It is shown how a given standard online learner can be converted into a learner being able to cope with delay and analyzes how the mistake bounds are related. The analysis is provided for both the adversarial setting and the setting when the instances are drawn according to a (slowly) drifting distribution.
Chernov and Hutter establish bounds on future prediction errors in online prediction of computably stochastic sequences. This is a setting where fairly tight bounds exist for the total number of errors leading one to suspect that the future errors must necessarily decrease with time. The main contribution of this paper is a quantitative treatment of this observation in terms of (a variant of) Kolmogorov complexity.

Defensive Forecasting: The next series of papers investigates the problem of predicting (in an online fashion) the binary label associated with a given instance. The setting is agnostic, i.e., it is not assumed that there is a hidden function (or distribution) which maps instances to labels. Instead of using a comparison class or expert advice (such that the learner competes with the best function from the comparison class or with the best expert), the analysis aims at verifying two abstract properties of the Forecasters prediction algorithm: being “well-calibrated” and having “high resolution”. If a prediction p∈[0,1] is viewed as the Forecasters “confidence” that the (unknown) binary label associated with a given instance x is 1 (where the label is revealed by the adversary after the Forecaster made its commitment), then being well calibrated roughly means that the Forecasters confidence estimates are accurate on average. Having high resolution roughly means that the Forecaster is “suffficiently specific” to the given instance x.
Vovk describes a prediction algorithm for the Forecaster that provably is well-calibrated and has high resolution. The algorithm is kernel-based. The analysis is quite involved with results given in terms of the Fermi-Sobolev norm of Lip\-schitzian functions.
The next paper in the volume (again by Vovk) demonstrates how one can derive classical results in the “learning with expert advice” setting starting from results of the preceding paper. Here, the “expert class” is infinite-dimensional. Basically the learner competes with any decision rule of bounded Fermi-Sobolev norm. The main result is quite general and applies to a wide variety of decision-theoretic scenarios.
The third result in this series about “defensive forecasting” is the paper by Vovk and Nouretdinov which goes one step further. It considers multi-class forecasting and inserts an additional player called “Skeptic”. Loosely speaking, the Skeptic tries to make money on any lack of agreement between Forecaster and Reality. The main result basically states that, for any continuous strategy of Skeptic, there exists a strategy of Forecaster that does not allow Skeptic's capital to grow, regardless of what Reality is doing.

Teaching: In the teaching model, a teacher delivers examples to a learner. The examples are labeled according to a target function (known to the teacher but unknown to the learner). In order to avoid collusion between a learner and a teacher (such as agreement on a coding scheme which allows the teacher to encode a representation of the target function), the success criterion for teaching must be carefully defined. One popular definition prescribes that the target should be the only function which is left in the version space after all examples in the teaching sequence have been processed. It turns out, unfortunately, that, with this definition, many harmless looking concept classes may require intolerably long teaching sequences in the worst-case. For this reason, one finds several alternative success criteria in the literature.

Balbach and Zeugmann make a new suggestion for a teaching model where collusion is avoided in a different fashion. They augment the concept class by a neighbourhood relation and let the teaching process proceed in rounds. In each round, the teacher presents a new example x. If the learners current hypothesis errs on x, the learner may pass from h to a new hypothesis h′ from the neighbourhood (where h′ must be chosen such as to make a minimal number of mistakes on the seen examples). Balbach and Zeugman analyze several variants of this approach and relate it to the classical models. It is a nice feature of their approach that it leads to meaningful statements about teachability of classes over an infinite domain.

August 2005 Sanjay Jain

Hans Ulrich Simon

Etsuji Tomita

©Copyright Notice:
The document of this page is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provision of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under German Copyright Law.

back to the ALT 2005 Proceedings Page

uparrow uparrow back to the ALT Archives