Clustering the Normalized Compression Distance for Virus Data

Authors: Kimihito Ito, Thomas Zeugmann1, and Yu Zhu

Source: Proceedings of the Sixth Workshop on Learning with Logics and Logics for Learning (LLLL 2009), Kyodai Kaikan, Kyoto, Japan, June 6-7, 2009, pp. 56 - 67, 2009.

Abstract. The present paper analyzes the usefulness of the normalized compression distance for the problem to cluster the HA sequences of virus data for the HA gene in dependence on the available compressors. Using the CompLearn Toolkit, the built-in compressors zlib and bzip are compared.

Moreover, a comparison is made with respect to hierarchical and spectral clustering. For the hierarchical clustering, hclust from the R package is used, and the spectral clustering is done via the kLine algorithm proposed by Fischer and Poland (2004).

Our results are very promising and show that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip compressor and, if all data are concerned, then hierarchical clustering is a bit better than spectral clustering via kLines.

1 Supported by MEXT Grant-in-Aid for Scientific Research on Priority Areas under Grant No. 21013001.
©Copyright 2009, Authors