Learning Theory emerged roughly forty years ago, and some of the pioneering
work is still influential, e.g., Gold's (1967) paper on Language
Identification in the Limit (cf. Information and Control 10:447-474).
Despite the huge amount of research invested in this area, the state of the art
in modeling learning is still much less satisfactory than in other
areas of theoretical computer science. For example, around 60 years
ago computability theory emerged. Initially, many different
models have been introduced, e.g., Turing machines, partial recursive
functions, and Markov algorithms. Nevertheless,
later on, all those models have been proved to be equivalent.
The situation in algorithmic learning theory is, however, quite different. Numerous mathematical models of learning have been proposed during the last three decades. Nevertheless, different models give vastly different results concerning the learnability and non-learnability of objects one wants to learn. Hence, finding an appropriate definition of learning which covers most aspects of learning is also part of the goals aimed at in algorithmic learning theory. Additionally, it is necessary to develop a unified theory of learning as well as techniques to translate the resulting theories into applications. On the other hand, machine learning has also found interesting and non-trivial applications but often our ability to thoroughly analyze the systems implemented does not keep up proportionally.
Moreover, nowadays the data collected in in various fields such as biology, finance, retail, astronomy, medicine are extremely rapidly growing, but our ability to discover useful knowledge from such huge data sets is still too limited. Clearly, powerful learning systems would be an enormous help in automatically extracting new interrelations, knowledge, patterns and the like from those and other huge collections of data. Thus, there are growing challenges to the field of machine learning and its foundations that require further efforts to develop the theories needed to provide, for example, performance guarantees, to automize the development of relevant software and the like.
Each learning model specifies the learner, the learning domain, the source of information, the hypothesis space, what background knowledge is available and how it can be used, and finally, the criterion of success. For seeing how different models can arise by specifying these parameters, we shall outline throughout this introduction different possibilities to do so. While the learner is always an algorithm, it may also be restricted in one way or another, e.g., by requiring it to be space and/or time efficient. The different learning domains considered range from learning unknown concepts, such as table, chair, car, different diseases and so on. What is then aimed at is learning a ``rule'' to separate positive examples from negative ones. For example, given a description of symptoms, the ``rule'' learned must correctly classify whether or not a particular disease is present.
In his invited lecture, Maruoka (co-authored by Takimoto) is studying structured weight-based prediction algorithms. The prediction algorithms considered have a pool of experts, say E1,..., EN, that at each trial, on input any member of the underlying domain, outputs either zero or one. That is, the experts perform a classification task as described above. The algorithm somehow combines these predictions and outputs its classification. Afterwards, the true classification is received, and the next trial starts. If the output made by the learner has been false, a prediction error occurred. When making a prediction error, the learner gets a penalty that is expressed by a suitably defined loss function. The learning goal consists in minimizing the loss function. The experts are assigned weights which are updated after each trial. While previous work considered the experts to be arranged on one layer only, the present paper outlines interesting ways to arrange the experts on a tree structure. As a result, the expert model can be applied to search for the best pruning in a straightforward fashion by using a dynamic programming scheme.
Learning Logic Programs and Formulae
Inductive logic programming and learning logic programs constitutes a core area of the ALT meetings, too. The basic scenario can be described as follows. The learner is generally provided background knowledge B as well as (sequences of) positive (E+) and negative examples (E-) (all of which can be regarded as logic programs, too). However, E+ and E- usually contain only ground clauses with empty body. The learning goal consists in inferring a hypothesis H (again a logic program) such that E+ can be derived from B & H while B & H & E- is contradiction free. Again, if growing initial segments of E+ and E- are provided, we arrive at learning in the limit. Variations of this model include the use of membership queries to obtain E+ and E- and of equivalence queries (or disjointness queries) to terminate the learning process. Recently, various authors looked also at different possibilities to formulate queries, and we shall describe some of them when talking about the relevant papers in these proceedings. Alternatively, E+ and E- may be drawn randomly with respect to some unknown probability distribution, and the learner is required to produce with high confidence a hypothesis that has small error (both measured with respect to the underlying probability distribution).
In his invited lecture, Wrobel addresses scalability issues in inductive logic programming. This research is motivated by demands in knowledge discovery in databases. Very often, the databases available store the information in relational database management systems. On the other hand, most currently used knowledge discovery methods rely on propositional data analysis techniques such as decision tree learners (C4.5), or propositional rule learning methods (e.g. CN2). While these methods have the benefit of being efficient in practice, they cannot directly deal with multiple relations that are typical for relational databases. Inductive logic programming is much better suited for such purposes, but until recently it lacked the efficiency to deal with huge amount of data. Scalability turned out to be very important in this regard, and Wrobel's paper presents the progress made recently.
In their invited paper McCreath and Sharma present LIME: a system for learning relations. Their inductive logic programming system induces logic programs from ground facts. This is done via a Bayesian heuristic, where explosion of the search space is tamed by exploiting the structure of the hypothesis space. Prior probabilities are assigned by applying Occam's razor, i.e., simpler hypotheses get higher probability. Favoring the Bayesian approach leads to another peculiarity. In contrast to a variety of other approaches, LIME is not growing a clause one literal at a time. Instead, it builds candidate clauses from groups of literals. The clauses obtained can be efficiently combined for obtaining new clauses such that the coverage of the resulting clause is just the intersection of the coverage of the clauses it is built from. The overall result is a learning system for the fixed example size framework that is, as experiments show, particularly good when it has to learn recursive definitions in the presence of noise.
Krishna Rao and Sattar present a polynomial time learning algorithm for a rich class of logic programs, thereby considerably extending and (partially correcting) results obtained by Arimura (cf. H. Arimura, Learning acyclic first-order Horn sentences from entailment, in Proc. ALT'97, Lecture Notes in Artificial Intelligence 1316, pp. 432-445). The information source are equivalence, subsumption and request-for-hint queries. Input to a subsumption query is a clause C, and it is answered ``yes'' iff C is a tautology or H* models C, where H* denotes the target concept. Otherwise, the answer is just ``no.'' A request-for-hint query takes as input a ground clause, and answers ``yes'' provided C is subsumed by H*. Otherwise, the reply is ``no'' and a hint, i.e., an atom along with a suitable substitution that can be refuted from target and the body of ground clause is returned. As a matter of fact, all these queries can be answered in time polynomial in the length of the target and C. The main new feature included in their article is the target class of finely-moded logic programs that allow to include local variables. Moreover, background knowledge previously learned is incrementally used during the learning process.
Besides ILP and the techniques developed within this framework, there is another major line of research that conceptually fits into the setting of learning logic programs, i.e., learning subclasses of concepts expressible by elementary formal systems (abbr. EFS). This year, Sugimoto is continuing along this line of research by extending the EFS's to linearly-moded EFS's. Again, the main new feature is the inclusion of local variables that turned out to be important to define translations over context-sensitive languages. The main goal is then the design of an efficient learner for such translations from input-output sentences, i.e., positive examples only. This goal is partially achieved by providing an algorithm that learns the whole class of translations definable by linearly-moded EFS's such that the number of clauses in the defining EFS's and the length of each clause are bounded by some a priori fixed constant. A natural generalization of this result would be bounding the length of the output sentences by a constant multiple of the input length. However, the resulting class of translations is not learnable from data at all.
Yamamoto provides a rigorous analysis of several learning techniques developed in the area of machine learning such as saturant generalization, bottom generalization, V*-operation with generalization and inverse entailment. The main goal is to present a unifying framework from which all these methods can be obtained as instances. Within this framework obtained the methods are compared to one another, e.g. with respect to their inference power.
Satoh is dealing with case-based representability of Boolean function. Intuitively, a case base contains knowledge already obtained. If a new task has to be handled, i.e., one that is not in the case base, one looks for a case that is similar, where the similarity is measured by an appropriate measure. Finding good similarity measures has attracted considerable attention. Looking at a similarity measure that is based on set inclusion of different attributes in a case, Satoh establishes nice connections to the monotone theory initiated by Bshouty in 1993.
Finally, Verbeurgt addresses the problem of learning subclasses of monotone DNF. Learning DNF efficiently is for sure of major interest but apparently very hard. The learning model considered is a variant of the Valiant's PAC-learning model. In this model, examples are drawn randomly, and with high confidence, one has to find a hypothesis that has only a small error. Here, the error is measured by summing up all probabilities for elements in the symmetric difference of the target and the hypothesis output. Usually, the learner has to succeed for every underlying probability distribution. The version considered by Verbeurgt restricts the class of probability distributions to the uniform distribution. He extends previous results on read-once DNF within this model by refining the Fourier analysis methodology.
Learning Formal Languages
In this setting, the learning domain is the set of all strings over some fixed finite alphabet, and the different objects to be learned are subsets of strings. The source of information may be augmenting initial segments of sequences exhausting all strings over the underlying alphabet that are classified with respect to their containment in the target language (learning from informant), or just growing initial segments of sequences containing eventually all the strings that belong to the target language (text, or positive data). The learner has then to map finite sequences of strings into hypotheses about the target language. The investigation of scenarios in which the sequence of computed hypotheses stabilizes to a correct finite description (e.g., a grammar, an acceptor, a pattern) of the target has attracted much attention and is referred to as learning in the limit. Instead of requesting the learner to converge syntactically one can also consider semantical convergence, i.e., beyond some point exclusively hypotheses are output that generate the same language. The latter scenario is usually referred to as behaviorally correct learning. As for the present volume, there are several papers dealing with these models.
Head et al. study the learnability of subclasses of regular languages. While the whole class of regular languages has been known to be non-inferable from positive data, certain subclasses are. In particular, they pinpoint the common, essential properties of the previously unrelated frameworks of reversible languages and locally testable languages by defining and studying equivalence relations over the states of finite automata. If appropriately defined, both language classes emerge. The learnability result is based on Angluin's characteristic sample technique and all definable classes are shown to be learnable. Then, finally, the authors show that even the whole class of regular languages is approximately identifiable by using any of the defined classes as hypothesis space. That is, instead of synthesizing an acceptor correctly describing the target language, the best fit from the hypothesis space is learned.
Case and Jain consider indexed families of uniformly recursive languages, i.e., language classes that are recursively enumerable in a way such that the membership problem is uniformly decidable for all languages enumerated. Now, given as input any index (or a program) generating any such class, they address the problem of whether or not a learner, if there is any, can be algorithmically synthesized that learns the whole class even from noisy data. Noisy data are defined along the model introduced by Stephan in his ALT'95 paper. Roughly speaking, in this model correct data occur infinitely often while incorrect data are presented only finitely many times. The new restriction made is that the noisy data are computable. Then, the main positive result obtained is very strong: grammars for each indexed family can be learned from computable noisy positive data within the framework to converge semantically. And these learners can all be synthesized. Thus, there is a huge gap of what can be learned in a completely computable universe and from arbitrary positive data. In particular, these results show that additional background knowledge can considerably enhance the learning capabilities.
In a sense, Stephan and Ventsov study the same problem though in a different context, i.e., whether or not background knowledge may help in learning (here called semantical knowledge). Now, the language classes are defined via algebraic structures (e.g., monoids, ideals of a given ring, vector spaces) and the background knowledge is provided in the form of programs for the underlying algebraic operations. What is shown is that such background knowledge can improve both, the overall learning power as well as the efficiency of learners (measured in the number of mind changes to be performed). Finally, a pure algebraic notion is characterized in terms of pure learning theory. A recursive commutative ring is Noetherian iff the class of its ideals is behaviorally correct learnable from positive data.
But there are more ways to attack the problem of how additional knowledge may help. In her ALT'95 paper, Meyer has observed that in the setting of learning indexed families from positive data, probabilistic learning under monotonicity constraints is more powerful than deterministic learning. A probabilistic learner is allowed to flip a coin each time it reads a new example, and to branch its computation in dependence on the outcome of the coin flip. The monotonicity constraints formalize different versions of how to realize the subset principle to avoid overgeneralization, and these formalizations go in part back to Jantke's paper at the very first ALT meeting in 1990. This year, Meyer asks what knowledge is necessary to compensate the additional power of probabilistic learners. Now, knowledge is provided in form of oracles, and instead of flipping a coin, the deterministic learner may ask the oracle A a membership query, i.e., ``x in A?,'' where x depends on the examples received so far. For getting a flavor of the results obtained, we just mention two. First, if probabilistic learners under monotonicity constraints are requested to learn with probability (p > 2/3 and p > 1/2 in dependence on the particular type of the monotonicity constraint) then the oracle for the halting problem does suffices to compensate the power of probabilistic learning. However, these bounds are tight, i.e., if p = 2/3 (p = 1/2), then the oracle for the halting problem is too weak.
Certain classes of languages are not inferable from positive data among them the regular languages and any superset thereof. This result goes back to Gold's (1967) paper, where he showed that every language class containing at least one infinite language and all finite languages is not learnable from positive data. Thus, one may think that there are no interesting languages classes at all that can be inferred from positive data. However, this is a misleading impression. The most prominent counterexample are the pattern languages introduced by Angluin in 1980. Patterns are a very natural and convenient way to define languages. Take some constant symbols and symbols for variables. Then every finite non-null string over these symbols constitutes a pattern, e.g. ax1bbx1. The language generated by a pattern p is the set of all strings that can be obtained by substituting constant strings for the variables occurring in p. A pattern is said to be regular if every variable symbol occurs at most once in it.
Sato et al. study the learnability of the language class RPk that can be obtained by taking the union of at most k regular pattern languages, where k is a priori fixed. This class has be shown to be learnable from positive data by Wright in 1989. Thus, there are characteristic samples for all these languages, i.e., for every language L there is a finite subset S of L such that S is subset of L' implies L is subset of L' for every L' in RPk. These characteristic samples play an important role in the design of learning algorithms. The present paper shows that there is a simple way for getting such characteristic samples by taking all substitutions of size 1 (or 2) provided there are at least 2k+1 (2k-1) many constants.
Sakamoto studies the versions of the consistency problem for one-variable pattern languages that may be interesting when learning from noisy data is required. The problem is, given a set of positive and negative examples, respectively, does there exist a one-variable pattern generating all positive examples and none of the negative ones. The new idea introduced is that the given strings may contain wild cards. In particular, he shows the consistency problem to be NP-complete provided the pattern must separate the set of positive and negative examples for all possible replacements of the wild cards by constant symbols.
In all the papers mentioned above, the learner has been required to behave appropriately when getting data for the languages to be learned. This also implied that some prior knowledge is provided in the form of a suitable hypothesis space. However, looking at applications such as data mining, it is well conceivable that we do not have such prior knowledge. Thus, one may just guess a hypothesis space. But then the question arises of how the learner behaves when getting data for a target that has no representation within the guessed hypothesis space. This question is meaningful as long as the hypothesis space guessed is not an acceptable programming system but rather, let's say, an indexed family. The idea that the learner must be able to refute the hypothesis space has been introduced by Mukouchi and Arikawa in their 1993 ALT paper. Subsequently, Lange and Watson (ALT'94) modified this approach by requesting that the learner must be able to refute initial segments of data sequences that do no correspond to any language in the hypothesis space. This year, Jain is continuing along this line of research. Instead of looking at indexed families as in previous work, he considers general classes of recursively enumerable languages. Now, allowing the class of all computer programs as hypothesis space, one can still insist to refute all initial segments of texts (informants) that do not correspond to any language in . Alternatively, one may also allow the learner to either refute or identify them. Finally, one may require the learner to refute only initial segments of texts (informants) that it cannot learn. Surprisingly, the latter approach is the most powerful one, and, for learning from text, it also achieves the whole learning power of learning in the limit from positive data.
There is one more paper dealing with language learning, but is different from all the papers mentioned above in that it uses queries to gain information about the target objects to be learned. Therefore, we discuss it within the next section.
Learning via Queries
We can imagine the learning via queries scenario as the interaction between a teacher and a learner that can communicate using some prespecified query language. For example, when learning the concept of a chair, the learner my ask ``is a sofa an example of a chair?, '' or when learning a language, a query may be of the form ``is w a string from the target language?'' This type of question is referred to as a membership query. Alternatively, one can allow the learner to ask ``is G a grammar for the target language?'' The latter type of question is called equivalence query, and is easy to see how to generalize it to any learning domain. Clearly, a positive answer to an equivalence directly yields the information that the target has been learned correctly, and the learner can (and has to) stop. If the answer is negative, usually a counterexample is returned, too, i.e., an element from the symmetric difference of the target and the object described by the equivalence query. In general, whatever the query language is, now the learner is required to halt and to output a correct description for the target. Moreover, it is also easy to see that every indexed family can be learned from equivalence queries alone. However, this may require a huge amount of queries, and thus may be beyond feasibility. Therefore, within the learning via queries scenario, the query complexity is usually of major concern. What one usually wants is that the overall number of queries asked is polynomially bounded in the length of the target and the longest counterexample returned.
Melideo and Varricchio consider the learnability of unary output two-tape automata from equivalence and multiplicity queries. In order to understand what a multiplicity query is, we first have to explain what an automaton with multiplicity is. Suppose any field K, and a non-deterministic automaton that has assigned weights (i.e., elements from K) to each initial state, each final state and to each egde of the automaton. Such an automaton can be considered as computing a function that maps strings into elements of K, and a multiplicity query returns the value of this function for a given string. Now, the authors show that the behavior of a unary output two-tape automaton can be identified with the behavior of a suitably defined automaton with multiplicity. Thus, the original learning problem is reduced to that of learning the resulting automata with multiplicity. The learner provided has a query complexity that is polynomially bounded in the size of the automaton with multiplicity.
Fischlin asks whether or not learning from membership queries can be sped up by parallelizing it. Defining the depth of a query q to be the number of other queries on which q depends upon and the query depth of a learning algorithm to be the maximum query depth taken over all queries made, the problem of whether or not a query learner can be parallelized is then equivalent to asking whether or not the query depth can be reduced. Assuming the existence of cryptographic one-way functions, Fischlin proves the following strong result: for any fixed polynomial d, there is a concept class Cn that is efficiently query learnable from membership queries alone in query depth d(n)+1, but Cn cannot be weakly predicted from membership and equivalence queries in depth d(n).
Damaschke provides a positive result concerning the parallelizability of learning Boolean functions that depend on only a few variables. Moreover, he is not only looking at the overall number of queries but also at the resulting overall complexity of learning.
Another lower bound for the overall number of membership queries is given by Shevchenko and Zolotykh. They consider the problem of learning half-spaces over finite subsets of the n-dimensional Euclidean vector space. This is done by carefully elaborating the structure of so-called teaching sets for half-spaces, i.e., sets such that only one half-space agrees with all points in them. The lower bound obtained is close to the best known upper bound.
Ben-David and Lindenbaum (cf. EuroCOLT'95, LNAI 1208) proposed several models to adapt the idea of learning from positive data only (that has attracted so much attention in language learning) to the PAC model. Denis is continuing along this line of research. He defines an appropriate PAC model and shows that extra information concerning the underlying distribution must be provided. This information is obtained via statistical queries. A couple of concept classes are shown to be learnable within the model defined, e.g. k-DNF and k-decision lists.
Learning Recursive Functions
The area of learning recursive functions is traditionally well represented in the ALT series. The information given to the learner is usually augmenting sequences f(0), f(1), f(2), ... of the function f to be learned. Admissible hypothesis spaces are recursive enumerations of partial recursive functions that comprise the target class. Again, the learner has to output a sequence of hypotheses about the target function, i.e., indices or encodings of computer programs. The learning goal consists in identifying in the limit an index in the relevant enumeration such that the enumerated function correctly computes the target with respect to the correctness criterion introduced. Starting with Gold's and Putnam's pioneering work, this learning model has been widely studied. Many variations of this basic setting have been considered, and the present volume provides further specifications.
Suppose you have a learner for a class U1 and another class U2. Now, it would be nice to have a more powerful learner that can identify simultaneously U1 and U2 . However, learning in the limit is not closed under union, and this fact led Apsitis et al. to investigate the following refined version of closedness under union. Assume you have classes U1, ..., Un each of which is learnable in the limit. What can be said about the learnability of the union of all these classes provided that every union of at most n-1 classes is learnable in the limit? Clearly, the answer may depend on n, since for n = 2 the answer is no as mentioned above. Therefore, more precisely, one has to ask whether or not there exists an n such that the union of all such classes is always learnable. The minimal such n is referred to as the closedness degree, and the authors determine the degree for a large number of learning types.
A natural variation of the basic scenario described above is prediction. Now, instead of outputting hypotheses about the target function, the learner can keep its guesses. Instead, the learning success is measured by its ability to predict the function values for inputs not having seen before. That is, beyond some point the learner must be able to predict correctly. Again what is meant by correctly depends on the correctness criterion considered. Example for correctness criteria comprise always correct, correct for all but a finite number (prespecified or not), a certain fraction of the inputs and so on. Case et al. consider this model for the case that the targets may drift over time. While similar questions have been addressed within the PAC model, this is the first paper that studies concept drift in the more general, recursion theoretic setting. Different versions are proposed and related to one another. Moreover, the authors also analyze the learnability of some natural concept classes within their models. This is a nice combination of abstract and concrete examples.
Whenever learning in the limit is concerned, one usually cannot decide whether or not the learner has already converged to its final hypothesis for the actual target. Thus, it seems only natural to demand the learner to correctly reflect all the data already seen. Learners behaving thus are said to be consistent. At fist glance, it may also seem useless to allow a learner to output inconsistent hypotheses. Nevertheless, consistency is a severe restriction as has been shown elsewhere. In his paper, Stein is investigating the question of how the demand to learn consistently may effect the complexity of learning. That is, assuming a function class can be learned consistently, he shows that for every recursive bound, there is class that can be learned inconsistently with polynomial update time, but every consistent learner needs an update time which is above the recursive bound given.
Case, Ott et al. generalize the scenario described above by looking at learning an infinite branch of a computable tree. Thus the concept class considered is the set of all computable trees that contain at least one infinite computable branch. This work derives its motivation from process-control games.
Hirowatari and Arikawa extend the classical model to learning recursive real-valued functions. These function are regarded as computable interval mappings. Both coincidences and surprising differences to the learnability of natural-valued recursive functions are shown. In particular, these differences are established with respect to recursively enumerable classes and consistent identification. This work considerably extends their results presented at ALT'97.
Barzdins and Sarkans continue their work (with varying coauthors) presented at previous ALT meetings to design practically feasible inference algorithms. The new feature consists in using attribute grammars to express several kinds of additional knowledge about the objects to be learned.
We finish this section with the paper of Grieser et al. that looks at the learnability of recursive functions from quite a different perspective. The main problem studied is the validation of inductive learning systems. The authors propose a model for the validation task, and relate the amount of expertise necessary to validate a learning system to the amount of expertise needed for solving the learning problems considered. Within the model introduced, the ability to validate a learning system implies the ability to solve it.
Arimura et al. study the data mining problem to discover two-word association rules in large collections of unstructured texts. They present very efficient algorithms solving this task. Nice applications are outlined and the algorithms have been implemented and tested using a huge database of amino acid sequences.
Schmitt is investigating the sample complexity for neural trees. A neural tree is a feedforward neural network with at most one edge outgoing from each node. The paper relates the sample complexity to the VC dimension of classes of neural trees. The main result is a lower bound of nlog n for the sample complexity of neural trees with n inputs.