Authors: Doina Caragea, Jun Zhang, Jie Bao, Jyotishman Pathak, and
Artificial Intelligence Research Laboratory,
Center for Computational Intelligence, Learning, and Discovery,
Department of Computer Science,
Iowa State University, Ames, Iowa, U.S.A.
Source: Algorithmic Learning Theory, 16th International Conference,
ALT 2005, Singapore, October 2005, Proceedings,
(Sanjay Jain, Hans Ulrich Simon and Etsuji Tomita, Eds.),
Lecture Notes in Artificial Intelligence 3734, pp. 13 - 44, Springer 2005.
Development of high throughput data acquisition technologies, together with advances
in computing, and communications have resulted in an explosive growth in the number,
size, and diversity of potentially useful information sources. This has resulted in
unprecedented opportunities in data-driven knowledge acquisition and decision- making
in a number of emerging increasingly data-rich application domains such as
bioinformatics, environmental informatics, enterprise informatics, and social
informatics (among others). However, the massive size, semantic heterogeneity,
autonomy, and distributed nature of the data repositories present significant
hurdles in acquiring useful knowledge from the available data. This paper
introduces some of the algorithmic and statistical problems that arise in such
a setting, describes algorithms for learning classifiers from distributed data
that offer rigorous performance guarantees (relative to their centralized or
batch counterparts). It also describes how this approach can be extended to
work with autonomous, and hence, inevitably semantically heterogeneous data
sources, by making explicit, the ontologies (attributes and relationships
between attributes) associated with the data sources and reconciling the
semantic differences among the data sources from a user's point of view.
This allows user or context-dependent exploration of semantically heterogeneous
data sources. The resulting algorithms have been implemented in INDUS - an
open source software package for collaborative discovery from autonomous,
semantically heterogeneous, distributed data sources.
Much of this work has been carried out in
collaboration with members of the ISU Artificial Intelligence Research
Laboratory and has been supported in part by Iowa State University and
grants from the National Science Foundation (IIS 0219699) and the
National Institutes of Health (GM 0066387).
©Copyright 2005 Springer