Editors' Introduction

José L. Balcázar, Philip M. Long, and Frank Stephan

The conference “Algorithmic Learning Theory 2006” is dedicated to studies of learning from a mathematical and algorithmic perspective. Researchers consider various abstract models of the problem of learning and investigate how the learning goal in such a setting can be formulated and achieved. These models describe ways to define

the goal of learning,
how the learner retrieves information about its environment,
how to form of the learner's models of the world (in some cases).

Retrieving information is in some models is passive where the learner just views a stream of data. In other models, the learner is more active, asking questions or learning from its actions. Besides explicit formulation of hypotheses in an abstract language with respect to some indexing system, there are also more implicit methods like making predictions according to the current hypothesis on some arguments which then are evaluated with respect to their correctness and wrong predictions (coming from wrong hypotheses) incur some loss on the learner. In the following, a more detailed introduction is given to the five invited talks and then to the regular contributions.

Gunnar Rätsch works on boosting and support vector machines. His is also interested in online-learning, optimisation theory, new inference principles and new types of machine learning problems. He also applies his results to real word problems from computational biology and chemistry. In his invited talk for ALT 2006, Gunnar spoke about using boosting techniques to solve semi-infinite linear programs, which can be used to address a wide variety of learning problems, including learning to make complex predictions.

Carole Goble works on the World-Wide Web, particularly the Semantic Web and Electronic Science / Grids. As the name suggests, the Semantic Web aims to facilitate the expression and use of meaning on the World-Wide Web. Electronic Science is scientific investigation in which groups are distributed globally. In her invited lecture for DS 2006, Garole Goble presented these two areas and laid out why these two areas depend on each other.

Hans Ulrich Simon studies the complexity of learning, that is, how much of resources of various types are needed for solving theoretically formulated learning problems. In particular, he has worked on query-learning and statistical models. In his invited talk for ALT 2006, Hans Ulrich Simon described work on the learnability of function classes using statistical queries, in which an algorithm interacts with the environment by asking for estimates of probabilities. The model is motivated because previous work had shown that algorithms that obey such a restriction can be made robust against certain kinds of noise. For finite classes, Hans described connections between the complexity of learning with statistical queries and the structure of the matrix of correlations between pairs of possible target functions. The structure is captured by the spectral norm of this matrix.

Padhraic Smyth works on all aspects linked to large scale databases as they are found in many applications. To extract and retrieve useful information from such large data bases is an important practical problem. For that reason, his research focusses on using large databases to build descriptive models that are both accurate and understandable. His invited talk for DS 2006 is on data-driven discovery with statistical approaches. Generative probabilistic models have already been proven a useful framework in machine learning from scientific data and the key ideas of this research include (a) representing complex stochastic phenomena using the structured language of graphical models, (b) using latent (hidden) variables for inference about unobserved phenomena and (c) leveraging Bayesian ideas for learning and predicting. Padhraic Smyth began his talk with a brief review of learning from data with hidden variables and then discussed some recent work in this area.

Andrew Y. Ng has research interests in machine learning and pattern recognition, statistical artificial intelligence, reinforcement learning and adaptive control algorithms for text and web data processing. He presented the joint invited talk of ALT 2006 and DS 2006. His talk was on algorithms for control that learn by observing the behaviors of competent agents, rather than through trial and error, the traditional reinforcement learning approach.

The Presentation for the E. M. Gold Award. The first contributed talk presented at ALT 2006 was the talk “Learning unions of ω(1)-dimensional rectangles“ by Alp Atıcı and Rocco Servedio for which the first author received the E. M. Gold Award, as the program committee felt it was the best contribution submitted to ALT 2006 which is co-authored by a student. Atıcı and Servedio study the learnability of unions of rectangles over {0,1,...,b-1}ⁿ in dependence of b and n. They give algorithms polynomial in n and log b to learn concepts which are the majority of polynomially many or the union of polylogarithmically many rectangles of dimension a bit below log(n log b) and log²(n log b), respectively.

Query Learning. Query Learning is a learning model where a learner or pupil asks a teacher questions about the concept to be learned. One important component of this model is a formal query language used during the learning process; the teacher has to answer every query posed in this language correctly. In this model, the complexity of a learning problem is the maximum number of queries needed by the best learning algorithm provided that the answers of the teacher meet the given specifications; however the teacher himself can be adversary in the sense that he can make the learner to learn as slow as possible as long as he does not violate the constraints. In some settings, also probabilistic teachers are considered instead of adversarial ones.

Nader H. Bshouty and Ehab Wattad investigate the question of how to learn halfspaces with random consistent hypothesis. In their model, the learner combines in each round several randomly selected halfspaces consistent with all data seen so far by majority vote to one object and then queries the teacher whether these objects are correct. If so, the learner has succeeded; otherwise the teacher returns a counterexample where the hypothesis and the concept to be learnt disagree. In addition to the teacher, the learner has access to a random oracle returning half spaces consistent with the counterexamples seen so far. The authors show that this algorithm needs roughly only two thirds as many queries to the teacher as the best known previous algorithm working with single halfspaces as hypotheseses space.

Matti Kääriäinen deals with the setting where the learner receives mostly unlabeled data, but can actively ask a teacher to label some of the data. Most previous work on this topic has concerned the realizable case, in which some member of a concept class achieves perfect accuracy. Kääriäinen considers the effects of relaxing this constraint in different ways on what can be proved about active learning algorithms.

Jorge Castro extends the setting of exact learning with queries into the world of quantum mechanics. He obtains counterparts of a number of results on exact learning; the new results hold for algorithms that can ask queries that exploit quantum effects.

The complexity of teaching. Learning and teaching are viewing the learning process from two sides. While learning mainly focusses on the aspect of how to extract information from the teacher, teaching focusses on the question of how to help a pupil to learn fast; in the most pessimistic models, the teacher must force learning. In this model it can be more interesting to consider randomized or adversarial learners than cooperative ones; a teacher and a cooperative pupil might agree on some coding which permits rapid learning success. Nevertheless, the learner should have some type of consistency constraint since otherwise the teacher cannot force the learner to update wrong hypotheses.

Frank Balbach and Thomas Zeugmann consider in their paper a setting where the learning task consists only of finitely many concepts and the learner keeps any hypothesis until it becomes inconsistent with the current datum presented by the teacher; at that moment the learner revises the hypothesis to a new one chosen from all consistent hypothesis at random with respect to the uniform distribution. The authors show that it is NP-hard to find out whether a good teacher might force the learners to learn a given polynomial-sized class in a given time with high probability. Furthermore, the choice of the sequence on which the learners would succeed is hard; as otherwise one could simulate the learners on this sequence and retrieve their expected behaviour from that knowledge.

Inductive Inference and its complexity. Inductive inference is the learning-theoretic counterpart to recursion theory and studies the learnability of classes of recursive or recursively enumerable sets in the limit. Gold's fundamental paradigm is that such a class is given and that the learner sees an infinite sequence in arbitrary, perhaps adversary order containing all the elements of the set but nothing else except perhaps pause symbols. From these data and some implicit knowledge about the class the learner makes a finite sequence of conjectures such that the last one is an index for the set to be learned, that is, an algorithm which enumerates the members of the given set. Gold showed already that the class of all recursively enumerable sets is not learnable and since then many variants of his basic model have been addressed, which mainly tried to capture not only the learning process in principle but also its complexity. How many mind changes are needed, how much memory of data observed so far has to be kept, what types of revisions of the previous hypothesis to the current one is needed? An example for such an additional constraint is that some interesting classes but not all learnable ones can be identified by learners which never output a conjecture which is a proper subset of some previous conjecture.

Stephen Fenner and William Gasarch dedicated their paper to a specific learning problem, namely, given a language A find the minimum-state deterministic finite automaton accepting the language SUBSEQ(A) which consists of all substrings of strings contained in A; this language is always regular and thus the corresponding automaton exists. In their approach, the data is given as an informant which reveals not only the members of A but also the nonmembers of A. Nevertheless, SUBSEQ(A) can only be learned for restrictive classes of sets A like the class of all finite or the class of all regular sets. If the class is sufficiently rich, learning fails. For example there is no learner which learns SUBSEQ(A) for all polynomial time computable sets A. They show that for every recursive ordinal α there is a class such that one can learn SUBSEQ(A) from any given A in this class with α mind changes but not with β mind changes for any β < α.

Matthew de Brecht and Akihiro Yamamoto show in their paper that the class of unbounded unions of languages of regular patterns with constant segment length bound is inferable from positive data with an ordinal mind change bound. The authors give depending on the length of the constant segments considered and the size of the alphabet bounds which are always between the ordinals ω^ω and ω^{ω^ω}. The authors claim that their class is the first natural class (besides those classes as in the previous paper obtained by coding ordinals) for which the mind change complexity is an ordinal beyond ω^ω. The authors discover that there is a link from their topic to proof theory.

Sanjay Jain and Efim Kinber contributed to ALT 2006 two joint papers. In their first paper they deal with the following requirement: If a a learner does not see a full text T of a language L to be learnt but just a text of some subset, then it should still converge to some hypothesis which is a superset of the content of the text T. There are several variants considered with respect how the language W_e generated by the hypothesis relates to L: in the first variant, W_e ⊆ L, in the second variant, W_e ⊆ L' for some class L' in the class C of languages to be learnt; in the third variant, W_e ∈ C. It is shown that these three models are different and it is characterised when a uniformly recursive class of languages is learnable under one of these criteria.

Sanjay Jain and Efim Kinber consider in their second paper iterative learning where the learner reads one by one the data and either ignores it or updates the current hypothesis to a new one which only depends on the previous hypothesis and the current datum. The authors extend in their work this model such that they permit the learner to test its current hypothesis with a teacher by a subset query and to use the negative information arising from the counterexample for the case that they are wrong. The authors consider three variants of their model with respect to the choice of the counterexample by the teacher: whether it is the least negative counterexample, bounded by the maximum size of input seen so far or just arbitrary. The authors compare these three notions with each other and also with other important models from the field of inductive inference.

Sanjay Jain, Steffen Lange and Sandra Zilles study incremental, that is, iterative learning from either positive data only or from positive and negative data. They focus on natural requirements such as conservativeness and consistency. Conservativeness requires that whenever the learner makes a mind change it has already seen a counterexample to this hypothesis. Consistency requires that the learner always outputs a hypothesis which generates all data seen so far and perhaps also some more. There are several variants of these requirements, for example with respect to the question what the learer is permitted or not permitted to do with data not coming from any language to be learnt. The authors study how these versions relate to iterative learning.

Online learning. The difference between online and offline learning is that the online learner has to react to data immediately while the offline learner reads all the data and then comes up with a programme for the function. The most popular online learning model can be viewed as a prediction game to learn a function f: in each of a series of rounds, the learner encounters an item x; the learner makes a prediction y for the value f(x); the learner discovers the true value of f(x). For each wrong prediction, the learner might suffer some loss. The overall goal is keep the total loss small.

In many settings of online learning, there is already a pool of experts whose advice (predictions) are heard by the learner before making the prediction. The learner takes this advice into account and also collects statistics on the realiability of the various experts. It is often advisible to combine the expert predictions, e.g. through some kind of weighted voted, rather than to greedily follow the expert that appears to be best at a given time. Evaluating and combining experts has become a discipline on its own inside the community of online learning.

Nader H. Bshouty and Iddo Bentov focus on the question of the dependence of the performance of a prediction algorithm on the way the data is presented: does the data where the learner has to make a prediction for a Boolean function come adversarily, from a uniform distribution or from a random walk? The authors consider a few particular exact learning models based on a random walk stochastic process. Such models are more restricted than the well known general exact learning models. They give positive and negative results as to whether learning in these particular models is easier than in the general learning models.

Eyal Even-Dar, Michael Kearns and Jennifer Wortman want to incorporate explicit risk considerations into standard models of worst-case online learning: they want to combine the forecasts of the experts not only with respect to the expected rewards but also by taking into account the risk in order to obtain the best trade-off between these two parameters. They consider two common measures balancing returns and risk: the Sharpe ratio and the mean-variance criterion of Markowitz. It turns out to be impossible to build no-regret algorithms under these measures. But the authors show that the algorithm of Cesa-Bianchi, Mansour and Stoltz achieves nontrivial performance when a modified risk-return measure is used.

Vladimir Vovk considers the experts as given by a pool of prediction strategies represented as functions in a normed function class. Considering mostly those strategies whose norm is not too large, it is well known that there is a “master prediction strategy” that performs almost as well as all of these strategies. The author constructs a “leading prediction strategy” which even serves as a standard for the prediction strategies in the pool: each of them suffers a small loss to the degree that its predictions resemble the leading stategy's prediction and only those strategies are successful which copycat it. This result is first given for quadratic loss functions and later extended to other loss functions like Bregman divergences.

Chamy Allenberg, Peter Auer, László Györfi and György Ottucsák study the sequential prediction problem of combining expert advice. They consider a multi-round scenario and unbounded loss where the aim of the learner is to lose on long term in each round not more than the best expert. Furthermore, the feedback received by the learner is not complete. Such a scenario is called “partial monitoring” and the learner is informed about the performance of the expert it wants to track only with a certain probability; the scenario is the combination of the label efficient and multi-armed bandit problem. The authors obtain for bounded and unbounded losses the following results. In the case of bounded losses, the authors develop an algorithm whose expected regret is more or less the square root of the loss of the best expert. In the case of unbounded losses, the authors' algorithm achieves Hannan consistency, in dependence of the average over the squared loss of all experts.

Forecasting. The next papers address general questions similar to those in online learning. For example, how much rewards can a forecaster receive in the limit or how can Solomonoff's nonrecursive forecaster be approximated? The settings considered include predictions of values of functions from N to N by deterministic machines as well as probabilistic forcasters dealing with functions of finite domains.

Marcus Hutter addresses mainly the question what can be said about the expected rewards on the long run. As they are less and less secure to be obtained, the author introduces some discounting factors for future rewards. He compares the average reward U received in the first m rounds with the discounted sum over all possible future rewards from some round k onwards. The author considers arbitrary discount and reward sequences; that is, the discounts need not to be geometric and the environments do not need to be Markov decision processes. He shows that the limits of U for m → ∞ and V for k → ∞ are equal whenever both limits exist. Indeed it can happen that only one limit exists or even none. Therefore, the author gives a criterion such that this criterion and the existence of the limit of U imply the existence of the limit of V. The author also provides such a criterion for the reverse implication.

Jan Poland investigates stochastic model selection. In particular he is interested in the use of the posterior (as used in Bayes' rule) for future predictions. There are three principle ways on how to do this which the author called “marginalization (integration over the hypotheses with respect to the posterior)”, “MAP (taking the a posteriori most probable hypothesis)” and “stochastic model selection (selecting a hypothesis at random according to the posterior distribution)”. For his work, the author makes two assumptions: that the hypothesis class is countable and that it contains the data generating the distribution. For the first two methods mentioned by the author (marginalization and MAP), strong consistency theorems are already known; these theorems guarantee almost sure convergence of the predictions to the truth and give bounds on the loss. The corresponding result was missing for the third method (stochastic model selection) and the author closes this gap.

Shane Legg dedicates his work to the question of how to overcome the principle problem, that Solomonoff's inductive learning model is not computable and thus not usable in the practice. Indeed people have tried from time to time to modify and weaken Solomonoff's rule such that one obtains general and powerful theories of prediction which are computable. Such algorithms exist indeed. The author analyses the Kolmogorov complexity of sequences and shows that the minimum Kolmogorov complexity of some recursive sequence not predicted by a predictor is approximately a lower bound for the Kolmogorov complexity of the predictor itself.

Boosting, Support Vector Machines and Kernel Methods. The next papers deal with specific algorithms or methods of learning. Support vector machines can be thought of as conducting linear classification using a large, even infinite, collection of features that are computed as a function of the raw inputs. A kernel provides inner products in the derived feature space, so efficiently computable kernels are useful for learning. Boosting is a method to improve a weak learner to a stronger one by identifying a collection of weak hypotheses that complement one another; this is often accomplished by training weak learners on data that has been reweighted to assign higher priority to certain examples.

Leonid Kontorovich, Corinna Cortes and Mehryar Mohri provide an embedding into feature space for which all members of the previously identified and expressive class of piecewise-testable languages are linearly separable. They also show that the kernel associated with this embedding can be computed in quadratic time.

Kohei Hatano investigates smooth boosting. Smooth boosting algorithms obey a constraint that they do not change the weight of examples by much; these have been shown to have a number of advantages. At the same time, a refinement of AdaBoost called InfoBoost, which takes a more detailed account of the strengths of the weak learners, has also been shown to have advantages. The author develops a new algorithm, GiniBoost, which incorporates both ideas. He provides a theoretical analysis and also adapts GiniBoost to the filtering framework.

Hsuan-Tien Lin and Ling Li investigate ordinal regression. This is a type of multiclass classification in which the classes are totally ordered (e.g. “one star, two stars, three stars,…”). The authors improve the theoretical treatment of this subject and construct two ORBoost algorithms which they compare with an adapted version of the algorithm RankBoost of Freund, Iyer, Shapire and Singer. Experiments were carried out to compare the two ORBoost algorithms with RankBoost, AdaBoost and support vector machines.

Reinforcement learning. In reinforcement learning, an agent can accrue immediate costs or benefits from its actions, but its actions also affect the environment, which can impact its prospects for long-term reward. One important effect is that, often, an agent only learns about actions that it takes, and, in constrast with other learning settings, not about actions that it could have taken.

Daniil Ryabko and Marcus Hutter consider reinforcement learning in the context where observations can have any form of stochastic dependence on past observations and actions. Such environments may be more general than Markov decision processes. The agent knows that its environment belongs to some given family of countably many environments, but the agent does not know in which of these environments it is. The agent tries − as usual − to achieve the best possible asymptotic reward. The authors study when there is an agent achieving asymptotically the best possible reward for any environment in the class and give some sufficient conditions on the class for the existence of such an agent.

Takeshi Shibata, Ryo Yoshinaka and Takashi Chikayama extend recent work on the learnability from positive data of some non-regular subclasses of context-free grammars to probabilistic languages. The authors introduce a new subclass of the simple grammars, called unifiable simple grammars. This is a superclass of the right-unique simple grammars, which Ryo Yoshinakai showed to be effeciently learnable from positive data in previous work. The authors show that the right-unique simple grammars are unifiable within their class of unifiable simple grammars. Furthermore, they generalise finite Markov decision processes to simple context-free decision processes. The authors apply their results on right-unique simple grammars and propose a reinforcement learning method on simple context-free decision proceeses.

Statistical Learning. Supervised learning means that the learner receives pairs (x₀,y₀), (x₁,y₁),… of items and their classifications. In the case of unsupervised learning, class designations are not provided. Nevertheless, in certain cases, it is still possible to extract from the distribution of the x_n useful information which either permits to reconstruct the y_n or to get information which is almost as useful as the original values y_n. Another field of learning is the construction of ranking functions: search machines like Google or Yahoo! must not only find on the internet the pages matching the requests of the users but also put them into an order such that those pages which the user searches are among the first ones displayed. Many of these ranking functions are not explicitly constructed but learned by analyzing the user behaviour, for example, by tracking down which links are accessed by the user and which not.

Andreas Maurer proposes a method of unsupervised learning from processes which are stationary and vector-valued. The learning method selects a low-dimensional subspace and tries to keep the data-variance high and the variance of the velocity vector low. The idea behind this is to make use of short-time dependencies of the process. In the theoretical part of the paper, the author obtains for absolutely regular processes error bounds which depend on the β-mixing coefficients and the consistency. The experimental part is done with image processing that the algorithm can learn feature maps which are geometrically invariant.

Atsuyoshi Nakamura studies the complexity of the class C of ranking functions which split the n-dimensional Euclidean space via k − 1 parallel hyperplains into subsets mapped to 1,2,…,k, respectively. He shows that the graph dimension of C is Θ(n + k), which is considerably smaller than the graph dimension of the corresponding decision list problem. The importance of the graph dimension is that it can be translated into an upper bound of the number of examples needed in PAC learning. The author also adapts his technique to show a risk bound for learning C.

©Copyright Notice:
The document of this page is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provision of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under German Copyright Law.

back to the ALT 2006 Proceedings Page

uparrow uparrow back to the ALT Archives