Learning on the Web
(invited lecture for ALT and DS 2009)
Author: Fernando C. N. Pereira
Affiliation:
Google Inc.
(Mountain View, USA)
Abstract.
It is commonplace to say that the Web has changed everything. Machine
learning researchers often say that their projects and results respond
to that change with better methods for finding and organizing Web
information. However, not much of the theory, or even the current
practice, of machine learning take the Web seriously. We continue to
devote much effort to refining supervised learning, but the Web
reality is that labeled data is hard to obtain, while unlabeled data
is inexhaustible. We cling to the iid assumption, while all the Web
data generation processes drift rapidly and involve many hidden
correlations. Many of our theory and algorithms assume data
representations of fixed dimension, while in fact the dimensionality
of data, for example the number of distinct words in text, grows with
data size. While there has been much work recently on learning with
sparse representations, the actual patterns of sparsity on the Web are
not paid much attention. Those patterns might be very relevant to the
communication costs of distributed learning algorithms, which are
necessary at Web scale, but little work has been done on this.
Nevertheless, practical machine learning is thriving on the
Web. Statistical machine translation has developed non-parametric
algorithms that learn how to translate by mining the ever-growing
volume of source documents and their translations that are created on
the Web. Unsupervised learning methods infer useful latent semantic
structure from the statistics of term co-occurrences in Web
documents. Image search achieves improved ranking by learning from
user responses to search results. In all those cases, Web scale
demanded distributed algorithms.
I will review some of those practical successes to try to convince you
that they are not just engineering feats, but also rich sources of new
fundamental questions that we should be investigating.
©Copyright 2009 Author
|