Learning theory is an active research area with contributions from
various fields including artificial intelligence, theoretical computer
science, and statistics. The main thrust is an attempt to model
learning phenomena in precise ways and study the mathematical
properties of these scenarios. In this way one hopes to get a better
understanding of the learning scenarios and what is possible or
as we call it
learnable in each. Of course this goes with a study of algorithms
that achieve the required performance. Learning theory aims to define
reasonable models of phenomena and find provably successful algorithms
within each such model.
To complete the picture we also seek
impossibility results showing that certain things are not learnable
within a particular model, irrespective of the particular learning
algorithms or methods being employed.
Unlike computability theory, where we have a single uniform notion of what is computable across multiple models of computation, not all learning models are equivalent. This should not come as a surprise, as the range of learning phenomena found in nature is wide, and the way in which leaning systems are applied in real world engineering problems is varied. Learning models cover this spectrum and vary along various aspects such as parameters characterizing the learning environment, the learning agent, and evaluation criteria. Some of these are: Is there a ``teacher''? Is the teacher reliable or not? Is the learner passive or active? Is the learning function required to be computable or efficient (polynomial time)? Is learning done on-line, getting one example at a time, or in batch? Is the learner required to reproduce learned concepts exactly or is it merely required to produce a good approximation? Are the hypotheses output by the learner required to be in a certain representation class or are they free of syntactic constraints? Naturally such variations lead to dramatically different results. Learning theory has been extensively studying these aspects getting a deeper understanding of underlying phenomena and better algorithms for the various problems.
Over the last few years learning theory has had a direct impact on practice in machine learning and its various application areas, with some algorithms driving leading edge systems. To name a few techniques having roots in learning theory and that have been applied to real world problems, there are support vector machines, boosting techniques, on-line learning algorithms, and active learning methods. Each of these techniques has proven extremely effective in the respective application areas of relevance, such as pattern recognition, web and data mining, information extraction, and genomics. Such developments have recently inspired researchers in the field to investigate the relation between theory and practice, by combining theory with experimental validation or applications.
Thus the picture is not uniform, and the field is making progress by exploring new models to capture new problems and phenomena, and studying algorithmic questions within each of these models. It is with this light that the papers in the proceedings should be read. We have collected the papers in subgroups with headings to highlight certain similar aspects either in topic or style but one should keep in mind that often a paper could be classified in more than one category.
The invited lecture for ALT 2001 and DS 2001 by Setsuo Arikawa describes the Discovery Science Project in Japan which aimed to develop new methods for knowledge discovery, to install network environments for knowledge discovery, and to establish Discovery Science as a new area of Computer Science and Artificial Intelligence. Though algorithmic learning theory and machine learning have been integrated into this project, the researchers involved took a much broader perspective. Their work shed new light on many problems studied in learning theory and machine learning and led to a fruitful interaction between the different research groups participating in the project.
In her invited lecture, Dana Angluin presents a comprehensive survey of the state of the art of learning via membership or equivalence queries or both, a field she has initiated (cf.). Major emphasis is put on the number of queries needed to learn a class of concepts. This number is related to various combinatorial characterizations of concept classes such as the teaching dimension, the exclusion dimension, the extended teaching dimension, the fingerprint dimension, the sample exclusion dimension, the well-known Vapnik-Chervonenkis dimension, the abstract identification dimension, and the general dimension. Each of these dimensions emphasises a different view on the learning problem and leads to a better understanding of what facilitates or complicates query learning.
Robot Baby 2001 is presented by Paul R. Cohen et al. in his invited lecture. This paper provides strong evidence that meaningful representations are learnable by programs. Different notions of meaning are discussed, and special emphasis is put on a functional notion of meaning being appropriate for programs to learn. Several interesting algorithms are provided and experimental results of their application are surveyed. This work raises an interesting challenge of deriving theoretical results capturing aspects of practical relevance thus tightening the relation between theory and practice.
Papers in the first section deal with complexity aspects of learning. Yang studies learnability within the statistical query (SQ) model introduced by Kearns . This model captures one natural way to obtain noise robustness when examples are drawn identically and independently distributed (i.i.d.) from a fixed but unknown distribution, as in the well known PAC model . In this case, if the algorithm does not rely directly on specific examples but rather on measurable statistical properties of examples then it is guaranteed to be robust to classification noise . Yang studies learnability when the class of concepts in question includes highly correlated concepts, showing a lower bound in terms of the desired performance accuracy. This yields non-learnability results in the SQ model for a certain class of concepts which are contrasted with the PAC learnability of the same class. This result provides an interesting example for separating these learning models, since except for classes of parity functions practically all known PAC learnable classes have been shown to be SQ learnable as well.
The name boosting captures a class of algorithms that use ``weak'' learners (which provide good but imperfect hypotheses) to build more accurate hypotheses . The idea is to combine several runs of the weak learner in the process and techniques vary in the way this is done. It was recently observed ,  that the well known decision tree learning algorithms  can be viewed as boosting algorithms. Hatano improves these results by providing a tighter analysis of decision tree learning as boosting when the trees include splits with more than two branches.
It is well known ,  that an algorithm which uses hypotheses of bounded capacity and finds a hypothesis consistent with training data learns the concept class in question in the PAC learning model. Similar results hold for algorithms that minimize the number of inconsistencies with training data. Various negative results for neural networks showing that the above is infeasile have been derived. Síima continues this line by showing that for the sigmoid activation function even approximating the minimum training error is intractable. This is important as the popular backpropagation algorithm uses a gradient method to try to optimize exactly this function.
Support vector machines (SVM) ,  use a neural model with a single neuron and threshold activation function but combine two aspects to overcome learning difficulties with neural models. First, the input domain is implicitly enhanced to include a large number of possibly useful features. Second, the algorithm uses optimization techniques to find a threshold function of ``maximum margin'' providing robustness, since examples are not close to the decision surface of the hyperplane. The first aspect is done using the so called kernel functions which are used directly in the optimization procedure so that the enhanced features are not produced explicitly. Sadohara presents a kernel appropriate for learning over Boolean domains. Using this kernel is equivalent to learning a threshold element where features are all conjunctions of the basic feature set. This is highly relevant to the problem of learning DNF expressions; a question which has received considerable attention in learning theory. The paper also present experiments demonstrating that SVM using this kernel performs well and compares favorably with other systems on Boolean data.
Balcázar et al. propose a sampling based algorithm for solving the quadratic optimization problem involved in SVM (albeit not to the kernel construction). The intention is to improve complexity in cases where the number of examples is large. The basic idea is to use random sub-samples from the data set where the distribution is carefully controlled to iteratively improve the solution so that convergence is guaranteed in a small number of rounds. Algorithms are proposed both for the separable case and the ``noisy'' non-separable case, and practical complexity aspects of the methods are developed.
The next section introduces models which try to capture new aspect of learning phenomena and study the complexity of learning within these. Garg and Roth introduce the notion of coherence constraint. The idea is that several concepts exist over the instance space and they are correlated. This correlation is captured by a coherency constraint which effectively implies that certain inputs (on which the constraint is violated) are not possible. The paper shows that the existence of such constraints implies reduced learning complexity in several scenarios. Experiments illustrate that this indeed happens in practice as well.
Kwek introduces a model where several concepts are learned simultaneously and where some concepts may depend on others. Since intermediate concepts are used as features in other concept descriptions the overall descriptional power is much higher. Nevertheless the paper shows that learnability of the representation class used for each concept implies learnability of all concepts in several models e.g. the PAC model. When the learner is active, that is when membership queries are allowed, this does not hold.
Dooly et al. study the so-called multiple instance learning problem. In this problem, motivated by molecular drug activity applications, each example is described as a set of configurations one of which is responsible for the observed activity or label (cf. ). The paper studies the case where the ``label'' is real valued giving a quantitative rather than binary measure of the activity. Negative results on learning from examples are presented and a new model of active learning, allowing membership or value queries is shown to allow learnability.
On-line prediction games provide a model for evaluating learners or prediction strategies under very general conditions . In this framework an iterative game is played where in each step the learner makes a prediction, observes the true value and suffers a loss based on these two values and a fixed loss function. The prediction complexity of a sequence gives a lower bound on the loss of any prediction algorithm for the sequence and can thus be seen as another way to characterize the inherent complexity of strings. The paper by Kalnishkan et al. derives results on the average complexity when the sequence is i.i.d. Bernoulli and relates this to the information complexity of the sequence. As a result it is shown that the Kolmogorov complexity does not coincide with the prediction complexity for the binary game. The paper by Vyugin and V'yugin studies the relation between Kolmogorov and predictive complexity. In particular, a gap is established which depends logarithmically on the length of the strings.
Research in inductive inference follows seminal work by Gold  who introduced the model of learning in the limit. Here finite complexity rather than polynomial complexity defines the notion of feasibility. For concept learning, Gold defined two learning models, i.e., Text where the learner sees only positive examples of the underlying concept and Informant where both positive and negative examples are seen. Jain and Stephan study several intermediate models based on the strategy of information presentation where the learner switches from asking to see positive to negative examples or vice versa. A hierarchy between the notions is established, and a more refined hierarchy is shown in case that the number of times the learner switches example types is limited.
In a second paper Jain and Stephan study the problem of learning to separate pairs of disjoint sets which do not necessarily cover the instance space. Several restrictions on the learner are studied within this framework, for example: conservative learners who only abandon hypotheses which were contradicted by data, and set-driven learners whose hypotheses do exclusively depend on the range of the input. The effect of these restrictions and their interaction is extensively studied. The two notions mentioned here are not comparable if the learner converges on all data sequences.
Jain et al. study a related model of learning union of languages. Two main variants are discussed where the learner either needs to identify the union or is required to identify the element languages composed in the union. Several results relating the strength of the models are derived establishing hierarchies when the number of languages in the union is increased, as well as identifying complete problems under appropriate reductions for these classes.
Zilles studies a notion of meta-learning where a single learning algorithm can learn several concept classes by being given an index of the class as a parameter. Two scenarios are discussed where the learner either uses a single representation for hypotheses in all classes or can change the representation with the class. The paper studies the effect of restricting the learner, for example to be conservative as described above, on the learnable classes. Various separation results are given using finite concept classes to separate the models.
The next group of papers also deals with inductive inference. The common aspect studied by all these papers is the notion of refutation originally introduced in , which was in part motivated by the design of automatic discovery systems, in which the choice of the hypothesis class is a critical parameter. In their paper , the following scenario is considered. The learner is given a hypothesis space of uniformly recursive concepts in advance. Whenever the target concept can be correctly described by a member of this hypothesis space, then the learner has to identify it in the limit. If, however, the learner is fed data of a target concept that has no correct description within the hypothesis space given, then the learner has to refute the whole hypothesis space after a finite amount of time by outputting a special refutation symbol and stopping the learning process. Thus, within the model of learning refutably, the learner either identifies a target concept or itself indicates its inability to do so.
Mukouchi and Sato extend the original approach by relaxing the correctness criterion and by allowing noisy examples. In their new model, the learner must succeed to infer a target concept provided it has a (weak) k-neighbor in the hypothesis space; otherwise it has again to refute the hypothesis space. Here a (weak) k-neighbor is defined in terms of a distance over strings.
Jain et al. study several variations of learning refutably. Now, the target concepts are drawn from the set of all recursive functions and an acceptable programming system is given as hypothesis space. Thus, it is no longer appropriate to refute the whole hypothesis space, since it contains a correct description for every target. Nevertheless, the learner may not be able to solve its learning task. This can be indicated by either outputting the refuting symbol and stopping the learning process, or by converging to the refutation symbol, or by outputting the refutation symbol infinitely often. All these models of learning refutably are studied, related to one another as well as their combination with other, previously studied learning models within the setting of inductive inference.
Last but not least within this group of papers, Merkle and Stephan extend the original notion of learning with refutation from positive data by introducing the notion of refutation in the limit and by considerably extending the notion of Text. Now, the data-sequences are sequences of first-order sentences describing the target. Several new and interesting hierarchies are then established.
The papers in the last section study learnability of formal languages and structured data. Arimura et al. consider the problem of identifying tree patterns in marked data. The paper extends previous work by allowing gaps in the tree pattern, that is, internal sub-trees may be skipped when matching a tree pattern to an example tree. The paper shows that the task is solvable in polynomial time in an active learning setting where the learner can ask queries.
Elementary formal systems are similar to logic programs operating on strings and having a distinguished unary predicate. The true atoms for this predicate define a formal language. Lange et al. extend work on learnability  of such systems to allow negation in the logic program. The extension is done along the lines of stratified logic programs. Learnability is studied and compared to the case before the extension. In Gold's paradigm some positive results do not transfer to the extended systems, but in the PAC model the main known positive result is shown to hold for the extended systems.
The problem of learning regular languages has been extensively studied with several representation schemes. Dennis et al. study learnability of regular languages using a non-deterministic representation based on residuals - completion languages for prefixes of words in the language. While the representation is shown not to be polynomially learnable, parameters of the representation are studied empirically and these suggest a new learning algorithm with desirable properties. Experiments show that the algorithm compares favorably with other systems.
The class of Büchi automata defines languages over infinite strings. When modeling learnability of such languages one is faced with the question of examples of infinite size. The paper by de la Higuera and Janodet introduces a model of learning such languages from finite prefixes of examples. While the complete class is not learnable a sub-class is identified and shown learnable in the limit with polynomial update on each example.
As the above descriptions of papers demonstrate, the range of learning problems and issues addressed by the papers in this volume is rich and varied. While we have partitioned the papers mainly according to the techniques used, they could have been classified according to the classes of objects that are being learned. These include representations of formal languages, recursive functions, Boolean concepts over Boolean domains, real valued functions, neural networks, and kernel based SVM. Several of the papers also combine the theoretical study with an empirical investigation or validation of new ideas, and it would also be beneficial to classify them according to the types of relevant application areas. While the papers in this volume will surely not give an exhaustive list of problems addressed and types of theory developed in learning theory in general, it is our hope that they will give an idea of where the field stands currently and where it may be going in the future.