The Two Faces of Active Learning
(invited lecture for ALT 2009)
Author: Sanjoy Dasgupta
Affiliation:
Department of Computer Science and Engineering,
University of California, San Diego, USA
Abstract.
The active learning model is motivated by scenarios in which it is
easy to amass vast quantities of unlabeled data (images and videos
off the web, speech signals from microphone recordings, and so on)
but costly to obtain their labels. Like supervised learning, the
goal is ultimately to learn a classifier. But like unsupervised
learning, the data come unlabeled. More precisely, the labels are
hidden, and each of them can be revealed only at a cost. The idea
is to query the labels of just a few points that are especially
informative about the decision boundary, and thereby to obtain an
accurate classifier at significantly lower cost than regular
supervised learning.
There are two distinct narratives for explaining when active
learning is helpful. The first has to do with efficient search
through the hypothesis space: perhaps one can always explicitly
select query points whose labels will significantly shrink the set
of plausible classifiers (those roughly consistent with the labels
seen so far)? The second argument for active learning has to do
with exploiting cluster structure in data. Suppose, for instance,
that the unlabeled points form five nice clusters; with luck, these
clusters will be ``pure'' and only five labels will be necessary!
Both these scenarios are hopelessly optimistic. But I will show
that they each motivate realistic models that can effectively be
exploited by active learning algorithms. These algorithms have
provable label complexity bounds that are in some cases
exponentially lower than for supervised learning. I will also
present experiments with these algorithms, to illustrate their
behavior and get a sense of the gulf that still exists between the
theory and practice of active learning.
This is joint work with Alina Beygelzimer, Daniel Hsu, John
Langford, and Claire Monteleoni.
Biography:
Sanjoy Dasgupta is an Associate Professor in the Department of
Computer Science and Engineering at UC San Diego. He received his
PhD from Berkeley in 2000, and spent two years at AT&T Research
Labs before joining UCSD.
His area of research is algorithmic statistics, with a focus on
unsupervised and minimally supervised learning. He recently
completed a textbook, "Algorithms" (with Christos Papadimitriou and
Umesh Vazirani).
©Copyright 2009 Author
|