Building an Automated Statistician
(joint invited lecture for ALT/DS 2014)
Author: Zoubin Ghahramani
Affiliation:
Department of Engineering,
University of Cambridge, UK
Abstract.
We will live an era of abundant data and there is an increasing need
for methods to automate data analysis and statistics. I will describe
the "Automated Statistician", a project which aims to automate the
exploratory analysis and modelling of data. Our approach starts by
defining a large space of related probabilistic models via a grammar
over models, and then uses Bayesian marginal likelihood computations
to search over this space for one or a few good models of the
data. The aim is to find models which have both good predictive
performance, and are somewhat interpretable. Our initial work has
focused on the learning of unknown nonparametric regression functions,
and on learning models of time series data, both using Gaussian
processes. Once a good model has been found, the Automated
Statistician generates a natural language summary of the analysis,
producing a 10-15 page report with plots and tables describing the
analysis. I will discuss challenges such as: how to trade off
predictive performance and interpretability, how to translate complex
statistical concepts into natural language text that is understandable
by a numerate non-statistician, and how to integrate model
checking. This is joint work with James Lloyd and David Duvenaud
(Cambridge) and Roger Grosse and Josh Tenenbaum (MIT).
Bio.
Zoubin Ghahramani is Professor of Information Engineering at the
University of Cambridge, where he leads a group of about 30
researchers. He studied computer science and cognitive science at the
University of Pennsylvania, obtained his PhD from MIT in 1995, and
was a postdoctoral fellow at the University of Toronto. His academic
career includes concurrent appointments as one of the founding
members of the Gatsby Computational Neuroscience Unit in London, and
as a faculty member of CMU's Machine Learning Department for over 10
years. His current research focuses on nonparametric Bayesian
modelling and statistical machine learning. He has also worked on
applications to bioinformatics, econometrics, and a variety of
large-scale data modelling problems. He has published over 200
papers, receiving 25,000 citations (an h-index of 68). His work has
been funded by grants and donations from EPSRC, DARPA, Microsoft,
Google, Infosys, Facebook, Amazon, FX Concepts and a number of other
industrial partners. In 2013, he received a $750,000 Google Award for
research on building the Automatic Statistician. He serves on the
advisory boards of Opera Solutions and Microsoft Research Cambridge,
on the Steering Committee of the Cambridge Big Data Initiative, and
in a number of leadership roles as programme and general chair of the
leading international conferences in machine learning: AISTATS
(2005), ICML (2007, 2011), and NIPS (2013, 2014).
More information can be found
at http://mlg.eng.cam.ac.uk.
©Copyright Author
|