Clustering the Google Distance with Eigenvectors and Semidefinite Programming

Authors: Jan Poland and Thomas Zeugmann

Source: Knowledge Media Technologies, First International Core-to-Core Workshop, Dagstuhl, July 23-27, 2006, Germany, (Klaus P. Jantke & Gunther Kreuzberger, Eds., Diskussionsbeiträge, Institut fü Medien und Kommunikationswisschaft, Technische Universität Ilmenau No. 21, pp. 61 - 69, July 2006.
ISSN 1617-9048

Abstract. Web mining techniques are becoming increasingly popular and more accurate, as the information body of the World Wide Web grows and reflects a more and more comprehensive picture of the humans' view of the world. One simple web mining tool is called the Google distance and has been recently suggested by Cilibrasi and Vitányi. It is an information distance between two terms in natural language, and can be derived from the “similarity metric”, which is defined in the context of Kolmogorov complexity. The Google distance can be calculated from just counting how often the terms occur in the web (page counts), e.g. using the Google search engine. In this work, we compare two clustering methods for quickly and fully automatically decomposing a list of terms into semantically related groups: Spectral clustering and clustering by semidefinite programming.

©Copyright 2006, Authors