“Learning” is a complex phenomenon that is studied in different
scientific disciplines. A computer program with the ability to
“learn” contains mechanisms for gathering and evaluating
information and, consequently, for improving its performance.
Algorithmic Learning Theory provides a mathematical foundation
for the study of learning programs. It is concerned with the design
and analysis of learning algorithms. The analysis proceeds in a formal
model such as to provide measures for the performance of a learning
algorithm or for the inherent hardness of a given learning problem.
The variety of applications for algorithms that learn is reflected
in the variety of formal learning models. For instance, we can
distinguish between a passive mode of “learning from examples”
and active modes of learning where the algorithm has more control
about the information that is gathered. As for learning from examples,
a further decision is whether we impose statistical assumptions on the
sequence of examples or not. Furthermore, we find different success
criteria in different models (like “approximate learning” versus
“exact learning”).
The papers collected in this volume offer a broad view on the
current research in the field including studies on several learning
models (such as kernel-based learning, pac-learning, query learning,
inductive inference, learning from expert advice, online learning
and teaching). It comes without saying that these models are getting
more and more refined as a response to new challenges in computer
science (like taking advantage of the large amount of real-world data
that become available in digital form, or like serving the growing
need for adaptive techniques in the management and control of
Internet applications).
The volume is structured as follows. It first presents the invited
lectures and then the regular contributions. The latter are grouped
according to thematic categories. In the remainder of the introduction,
we provide the reader with a rough “road map”.
The invited lecture for ALT 2005 and DS 2005 by
Gary Bradshaw compares
Invention and Artificial Intelligence and points out some analogies.
Starting from two case studies, the invention of the air-plain (1799--1909)
and the invention of a model rocket by a group of high school students
in rural West Virginia in the late 1950's, it is argued that general
principals of invention may be applied to expedite the development of
AI systems.
The invited lecture for DS 2005 by
Neil R. Smalheiser
gives a report on
the current status of the
Arrowsmith Project.
The roots of this project
trace back to the 1980's when Don Swanson proposed the concept of
“undiscovered public knowledge” and published several examples in
which two disparate literatures held complementary pieces of knowledge
that, when brought together, made compelling and testable predictions
about potential therapies for human disorders. In continuation of this
work in the 1990's, Smalheiser and Swanson created a computer-assisted
search strategy (“Arrowsmith”). The lecture reviews the development
until today.
Ross D. King,
presenting the second invited lecture for DS 2005, pursues
the question whether it is possible to automate the scientific process.
This question is of increasing importance because, in many scientific
areas, data are being generated much faster than they can be effectively
analyzed by humans. In his lecture, King describes a physically implemented
robotic system that applies techniques from artificial intelligence to
carry out cycles of scientific experimentation. This is fleshed out by
describing the application of the system to complex research problems
in bioinformatics.
The invited lecture for ALT 2005 by
Vasant Honavar (co-authored by
Doina Caragea, Jun Zhang, Jie Bao, and Jyotishman Pathak) is concerned
with the problem of data-driven knowledge acquisition and decision-making.
Honavar describes the hurdles represented by massive size, semantic
heterogeneity, autonomy, and distributed nature of the data repositories.
He introduces some of the algorithmic and statistical problems that
arise in such a setting and presents algorithms for learning classifiers
from distributed data that offer rigorous performance guarantees relative
to their centralized counterparts. The lecture furthermore presents
techniques that help to cope with the problem of semantic heterogeneity.
The second invited lecture for ALT 2005 by
Chih-Jen Lin (co-authored by
Pai-Hsuen and Rong-En Fan) concerns the convex quadratic optimization
problem that has to be solved by Support Vector Machines (SVMs).
Since the underlying cost matrix is high-dimensional and dense, this
cannot be done by classical methods (like, for example, “Interior Point”).
Instead a decomposition method is used. It proceeds iteratively and has,
in each iteration, to solve a small-dimensional subproblem. Lin focuses
on decomposition methods of the SMO-type and elaborates how the
implementation of the procedure for “Working Set Selection” affects
the speed of convergence to the optimal solution. In a sense, Lin's
lecture offers the opportunity of looking deeply inside a widely used
learning machine.
We now turn our attention to the regular contributions contained in this
volume.
Kernel-based Learning:
During the last decade there has been a lot of interest in learning systems
that express the “similarity” between two “instances” as an inner product
of vectors in an appropriate feature space. The inner product operation is
often not carried out explicitly, but reduced to the evaluation of a so-called
kernel function which operates on instances of the original space. This offers
the opportunity to handle high-dimensional feature spaces in an efficient manner.
This strategy, introduced by Vapnik and co-workers in connection with the
so-called Support Vector Machine, is a theoretically well founded and very
powerful method that, in the years since its introduction, has already
outperformed many other learning systems in a wide variety of applications.
Gretton, Bousquet, Smola and Schölkopf continue a line of research where the
kernel-based approach is used as a tool for the detection of statistical
dependencies. More precisely, they provide a measure of statistical dependence
between random variables based on the Hilbert-Schmidt norm of a cross
covariance operator. Their method has some theoretically appealing
features while being experimentally competitive to existing methods.
Kowalczyk and Chapelle study the surprising phenomenon of “anti-learning”,
where data sets are represented in such a way that some well-known standard
learning techniques lead to hypotheses performing much worse than random
guessing. While there seem to exist “real-life” data sets which are of
that type, the authors consider artificial data whose analysis sheds some
light on this (at first glance) counterintuitive phenomenon. They identify
some abstract properties of a given positive definite kernel which inevitably
lead to anti-learning (for some standard learning techniques including
“large-margin classification” or “nearest neighbour”). They furthermore
explain which kernel-transformations convert “anti-learning” into “learning”
and which do not.
Bayesian and Statistical Models:
Causal networks are graphical representations for “causality” between
random variables. Like Bayesian networks or other graphical models they
represent knowledge being not as strict as formal logical statements but
being quite helpful for what is called “reasoning under uncertainty”.
Bayesian Inference leads to decisions that minimize a risk function.
Bayesian learning refers to the problem of inferring the unknown parameters
of a distribution (chosen from a known parameterized class of distributions).
Typically the a priory distribution for the unknown parameters
gives support to a wide range of parameters, whereas the a posteriori
distribution is peaked around the true parameter values.
By means of statistical information we can determine causal networks
only up to Markov-equivalence. An equivalence class can be represented
by a chordal chain graphs that contains directed and undirected edges.
Each undirected edge represents two mutually correlated variables where
we cannot know whether causality (if any) goes in one or the other
direction. He, Geng and Liang describe a hierarchical partitioning of
a class of Markov-equivalent causal networks (given by a chordal chain
graph) into finer and finer subclasses up to the point where the
partition consists of the single causal networks. They prove that,
at each stage of the refinement process, an equivalence class can be
again represented as a chordal chain graph.
Variational Bayesian Learning results from Bayesian Learning by
introducing a simplifying assumption (in case there are hidden variables)
that makes the approach computationally more tractable. Empirically it
is known to have good generalization performance in many applications.
Watanabe and Watanabe provide some additional theoretical support by
proving lower and upper bounds on the stochastic complexity in the
Variational Bayesian learning of the mixture of exponential families.
As for classification tasks, the Bayes classifier chooses the label that
has the maximum a posteriori probability. In order to apply the formula
of Bayes, we have to know the a priori probability and (an approximation
of) the class-conditional densities. Thonangi and Pudi suggest a new
method for the approximation of the class-conditional densities: they use
a class-conditional probability model of the data (based on certain
conjunctions of binary features and the corresponding relative
frequencies) and determine the density function of maximum entropy
that satisfies this model.
PAC-Learning:
In the PAC-learning model, the learner receives as input training examples,
drawn at random according to an unknown distribution and labeled according
to an unknown target function ƒ, and returns a “hypothesis”
h that
(with high probability of success) is a close approximation of ƒ. While
the first papers on pac-learning focussed on binary classification problems
with the probability of misclassification as an underlying pseudo-metric,
there have been many extensions of the basic model since then.
Fakcharoenphol and Kijsirikul revisit the well-studied problem of converting
base binary learners (coping with binary classification problems) into
multi-class learners. They reconsider various classical constructions,
including “one-versus-all” and the “(adaptive) directed acyclic graph
approach”, and come up with some new generalization error bounds.
The paper by Ryabko relates “PAC-learnability” and “computability” by
showing the following result: the number of examples needed by a recursive
PAC-learner to learn a computable function cannot be upper-bounded by any
computable function. The result is obtained by applying a negative result
on data compression based on Kolmogorov complexity.
The papers by Palmer and Goldberg and by Guttman, Vishwanathan and
Williamson are both dealing with so-called Probabilistic Deterministic
Finite State Automata (PDFA). A PDFA, in contrast to a DFA (its deterministic
counterpart), performs random state transitions and thus represents a
probability distribution over strings. In a recent paper from 2004, it was
shown by Clark and Thollard that the class of PDFAs is polynomially
pac-learnable where the polynomial (which bounds sample size and
run-time) depends on the size of the target PDFA, a measure for the
“distinguishability” of states and the expected length of strings
generated from any state. Their pseudo-metric (measuring the distance
between target distribution and the hypothesis) is the KL-divergence.
By passing from KL-divergence to the variation distance (still useful
for pattern classification tasks), Palmer and Goldberg are able to
considerably simplify the proof for PAC-learnability and to remove
the dependence on the expected string length. Guttman, Vishwanathan
and Williamson extend the result by Clark and Thollard into another
direction by generalizing the notion of “distinguishability”.
Query-Learning:
In the query-learning model, the learner is allowed to ask certain
queries to an oracle. The learning performance is measured by the
number and type of queries needed to exactly (sometimes approximately)
identify an unknown target concept ƒ (where “concept” means
“binary-valued function”). Among the most popular query
types are “membership queries (MQs)” (asking for the class label of an
instance) and “equivalence queries (EQs)” (asking for a “counterexample”
to the current hypothesis). Statistical queries can be used to get a
statistical estimate for the probability of an event that involves
a labeled random example. This query type plays a special role because
an algorithm that learns by means of statistical queries can be converted
into a noise-tolerant pac-learning algorithm.
Bennet and Bshouty investigate the possibility of learning
“attribute-efficiently” (such that the time-bound grows only
sub-linearly with the number of irrelevant attributes) with “corrupted
oracles” (whose answers may occasionally be wrong or missing). Their
main result is that an attribute-efficient learner expecting a perfectly
reliable oracle of type MQ or EQ can be converted into an
attribute-efficient learner that copes with corrupted oracles. (This
extends a result from FOCS 2004, where attribute-efficiency had not been
an issue.)
As shown by Köbler and Lindner in their paper presented at ALT 2002,
DNF-formulas can be learned by statistical queries under the uniform
distribution. The paper by Lindner in this volume improves on this result
by showing that the learning algorithm can cope with larger inaccuracies
of the empirical estimates provided by the oracle. A similar remark holds
for the related model of “learning by proper distance queries”.
Kato, Matsumoto and Miyahara discuss the learnability of certain logic
programs, called “Elementary Formal Systems (EFS)”, that represent
formal languages. These programs can, for instance, succinctly represent
the popular “regular pattern languages” whose learnability has been
well studied in the past. It is shown that EFS-definable languages can
be learned in polynomial time with membership and superset queries.
Furthermore, it is shown that various other combinations of query
types are insufficient for this purpose.
Jain, Lange and Zilles extend work by Lange and Zilles from COLT
2004 and ALT 2004. The latter papers revealed surprising relations
between Gold-style and query learning of indexable classes of
recursive languages. In their contribution to this volume, they
examine (arbitrary) classes of recursively enumerable languages
and analyze which of the relations still hold. It turns out that,
although some of the relations are lost, there is still a nice
hierarchy with several cross-connections between Gold-style and
query learning.
Inductive Inference:
In Gold's model of “learning in the limit from positive data”,
the learner should identify an unknown recursive (or recursively
enumerable) target language L (where L belongs to a class C
that is known to the learner). To this end, it receives a “text”
(basically an infinite list of the words in L). It keeps track
of a (description of a) “hypothesis” that is updated after the
reception of the next word from the text. We say that C is
learned in the limit, if the learner is able to identify every
language L ∈ C after finitely many steps. Within this model,
one distinguishes between “explanatory learning” (learner converges
to one description of the target concept) and “behaviourally
correct learning” (learner converges to the target concept but
may, in principle, use arbitrarily many descriptions for it). Between
these two extremes, we find the possibility of converging to the target
and using (in the limit) at most k different descriptions. This
leads to what is called “vacillatory learning” (actually an infinite
hierarchy of models that is in-between explanatory and behaviourally
correct learning). Another notion of interest is “U-shaped” versus
“Non U-shaped” learning, where a learner exhibits U-shaped behaviour
if it passes from the correct hypothesis to an incorrect-one, but
later returns to the correct-one. This kind of learning behaviour is
also discussed in the cognitive psychology literature.
Carlucci, Case, Jain and Stephan continue a line of research that
explores whether U-shaped learning behaviour is logically necessary
within the model of learning in the limit from positive data.
It was known that U-shaped learning behaviour is, indeed, necessary
for “behaviourally correct learning” but not for “explanatory
learning”. The authors refine this result in several ways. They show
that non U-shaped vacillatory learning collapses to explanatory learning.
Moreover, a U-shaped learner on the second level of the hierarchy can be
transformed into a non U-shaped behaviourally correct learner, whereas
a transformation of this kind is not in general possible if the U-shaped
learner is located on the third (or higher) level of the hierarchy.
Jain and Kinber consider the following variant of learning in the limit
from positive data: given a text for a union of n (pairwise disjoint)
languages, find in the limit m grammars such that each of them
generates some union of the n languages and every of the n languages
occurs in exactly one of these unions. The paper provides considerable
insight in this model.
Language Learning:
The next group of papers is concerned with learning languages in the
limit from positive data. It is known, for instance, that bounded unions
of non-erasable pattern languages and erasable regular pattern languages
are learnable in this model whereas regular languages are not.
Furthermore, a learning strategy that selects a language from the
version space that is minimal with respect to inclusion ensures
identification in the limit provided that each language L in the
target class has a characteristic set (i.e., a finite subset S
with L as the unique minimal language from the target class that
contains S).
Ng and Shinohara introduce another notion of minimality that refers
to the number of elements in a language being shorter than a specified
length. They show that this notion of minimality and a suitably
adapted notion of a characteristic set fit the same purpose for learning
languages in the limit from positive data as the classical notions
(with a range of applicability that has not become smaller). Thus, the
selection of a language from the version space that is minimal in the
new sense is an interesting alternative to the selection of an
inclusion-minimal language.
Clark and Eyraud consider the class of substitutable context-free
languages (building on the notion of substitutability by Harris from
the 1950's). They show that every substitutable context-free language
has a characteristic set, which implies their learnability in the limit
from positive data.
Fernau identifies a non-trivial subclass of regular languages
(representable by special regular expressions called “simple looping
expressions” in the paper) and presents an algorithm that has a
polynomial update time and learns simple looping expressions in the
limit from positive data.
Learning and Logic:
The following group of papers again deals with the model of learnability
in the limit from positive data. The target classes under consideration
are Prolog programs and logical structures, respectively.
The paper by Rao continues with the research project of identifying
fragments of Prolog that are learnable in the limit from positive
data. The new fragment studied in the paper (“recursion bounded”
Prolog programs) is incomparable with the previously identified
fragments and contains programs that neither belong to the class
of “linearly-moded programs” (introduced by Rao himself) nor to
the class of “safe programs” (introduced by Martin and Sharma).
Jain, Martin and Stephan discuss the problem of learning logical
structures in the limit from positive examples, where the positive
examples are logical formulas that are satisfied by the unknown target
structure. They prove learnability in this model except for a set of
logical structures of measure zero.
Learning from Expert Advice:
In the classical model of learning an unknown binary sequence
from expert advice, the learner predicts the sequence bitwise such
that learning proceeds in rounds. In every round, it first receives
N expert predictions p1, ... ,pN∈[0,1].
Then, it makes its
own prediction p∈[0,1] (in dependence of the experts predictions).
At the end of the round, it receives the next bit b of the sequence
and suffers loss |p−b| (the probability of making a wrong
prediction when 1 is chosen with probability p). The learner is
exposed to a worst-case analysis (like playing against an intelligent
adversary) in an “agnostic” setting (where any bit sequence is
conceivable). The goal of the learner is to keep the difference
between its own total loss and the total loss of the best expert
as small as possible. This difference is called the “regret” of
the learner. Many version of this basic model are found in the
literature. The goal within the theoretical analysis is finding
tight bounds on the smallest possible regret.
The paper by Harada, Takimoto and Maruoka studies the problem of
sequentially assigning a probability vector p over k options.
Then a cost vector c over the k options is revealed and the
learner suffers loss pTc
(cost averaged over p). Vector
p is chosen in dependence of (an average of) N probability
vectors that represent the “expert advice” in this setting.
The authors generalize the known performance bounds for two
well-known algorithms (Hedge Algorithm and Aggregating Algorithm)
to the case where each available option has different range of
possible losses. As mentioned in the paper, the setting can be
applied to the problem of finding a route with minimum total
expected delay through a network.
The paper by Poland and Hutter and the paper by Kujala and Elomaa
both study a variation of a problem that was introduced by Kalai
and Vempala at COLT 2003. The latter authors consider a broad class
of online optimization problems that contains “learning from expert
advice” as a special case. They furthermore present and analyze a
strategy named “follow the perturbed leader”. Poland and Hutter
extend this work in the bandit setting to the case of a countably
infinite pool of experts, and to the case in which, instead of
uniform upper bounds on the cost in each round, there are (slowly)
increasing bounds. Kujala and Elomaa likewise consider the framework
of Kalai and Vempala. They apply the strategy of Mc~Mahan and Blum
from COLT 2004 (which is designed such as to cope with an adaptive
adversary) to the more restricted case of an oblivious adversary.
This leads to stronger regret bounds.
The paper by Henderson Shawe-Taylor and Žerovnik generalizes
the results in the standard model of learning a binary sequence
with expert advice to the multi-dimensional case of a binary
vector sequence.
Online Learning:
In the classical model of “online learning”, there is a
class of target functions that is fixed in advance. Learning
proceeds in rounds. In every round, the learner first receives
an instance x, then makes a prediction y′ and, at the
end of the round, receives the true label y=ƒ(x) and suffers
loss l(y′,y). In case of a Boolean target class,
a natural choice for l is the zero-one loss (such that the
accumulated loss counts the number of mistakes). Many versions
of this basic model are found in the literature. The goal
within the analysis is finding the best possible loss bounds
(or mistake bounds, respectively) for a given class of target functions.
The paper by Mesterharm extends the standard worst-case online
learning framework by allowing delayed label feedback. It is
shown how a given standard online learner can be converted into
a learner being able to cope with delay and analyzes how the
mistake bounds are related. The analysis is provided for both
the adversarial setting and the setting when the instances are
drawn according to a (slowly) drifting distribution.
Chernov and Hutter establish bounds on future prediction errors
in online prediction of computably stochastic sequences. This is
a setting where fairly tight bounds exist for the total number of
errors leading one to suspect that the future errors must
necessarily decrease with time. The main contribution of this
paper is a quantitative treatment of this observation in terms
of (a variant of) Kolmogorov complexity.
Defensive Forecasting:
The next series of papers investigates the problem of predicting (in
an online fashion) the binary label associated with a given instance.
The setting is agnostic, i.e., it is not assumed that there
is a hidden function (or distribution) which maps instances to
labels. Instead of using a comparison class or expert advice
(such that the learner competes with the best function from the
comparison class or with the best expert), the analysis aims at
verifying two abstract properties of the Forecasters prediction
algorithm: being “well-calibrated” and having “high resolution”.
If a prediction p∈[0,1] is viewed as the Forecasters “confidence”
that the (unknown) binary label associated with a given instance x
is 1 (where the label is revealed by the adversary after the Forecaster
made its commitment), then being well calibrated roughly means that the
Forecasters confidence estimates are accurate on average. Having high
resolution roughly means that the Forecaster is “suffficiently
specific” to the given instance x.
Vovk describes a prediction algorithm for the Forecaster that provably
is well-calibrated and has high resolution. The algorithm is kernel-based.
The analysis is quite involved with results given in terms of the
Fermi-Sobolev norm of Lip\-schitzian functions.
The next paper in the volume (again by Vovk) demonstrates how one
can derive classical results in the “learning with expert advice” setting
starting from results of the preceding paper. Here, the “expert class” is
infinite-dimensional. Basically the learner competes with any decision
rule of bounded Fermi-Sobolev norm. The main result is quite general and
applies to a wide variety of decision-theoretic scenarios.
The third result in this series about “defensive forecasting” is the paper
by Vovk and Nouretdinov which goes one step further. It considers multi-class
forecasting and inserts an additional player called “Skeptic”.
Loosely speaking, the Skeptic tries to make money on any lack of agreement
between Forecaster and Reality. The main result basically states that,
for any continuous strategy of Skeptic, there exists a strategy of Forecaster
that does not allow Skeptic's capital to grow, regardless of what Reality
is doing.
Teaching:
In the teaching model, a teacher delivers examples to a learner. The
examples are labeled according to a target function (known to the teacher
but unknown to the learner). In order to avoid collusion between a learner
and a teacher (such as agreement on a coding scheme which allows the teacher
to encode a representation of the target function), the success criterion
for teaching must be carefully defined. One popular definition prescribes
that the target should be the only function which is left in the version
space after all examples in the teaching sequence have been processed.
It turns out, unfortunately, that, with this definition, many harmless
looking concept classes may require intolerably long teaching sequences
in the worst-case. For this reason, one finds several alternative success
criteria in the literature.
Balbach and Zeugmann make a new suggestion for a teaching model where
collusion is avoided in a different fashion. They augment the concept
class by a neighbourhood relation and let the teaching process proceed
in rounds. In each round, the teacher presents a new example x. If the
learners current hypothesis errs on x, the learner may pass from h to
a new hypothesis h′ from the neighbourhood (where h′ must be chosen
such as to make a minimal number of mistakes on the seen examples).
Balbach and Zeugman analyze several variants of this approach and relate
it to the classical models. It is a nice feature of their approach that
it leads to meaningful statements about teachability of classes over an
infinite domain.
©Copyright Notice:
The document of this page is subject to copyright. All rights are reserved,
whether the whole or part of the material is concerned, specifically the
rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other way, and storage
in data banks. Duplication of this publication or parts thereof is permitted
only under the provision of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained
from Springer-Verlag. Violations are liable for prosecution under
German Copyright Law.
|