TCS-TR-B-10-8

Date: Wed Dec 22 17:07:43 2010

Title: Master's Thesis: Virus Data Clustering based on Kolmogorov Complexity

Authors: Yu Zhu

Contact:

  • First name: Yu
  • Last name: Zhu
  • Address: Graduate School of Information Science and Technology Division of Computer Science Hokkaido University
  • Email: zhuyu07@ist.hokudai.ac.jp

Abstract. In this paper, we focus on one simple data mining method called Normalized Compression Distance (NCD) which has been suggested by Cilibrasi Vitányi. By this method, we analyzed the HA sequences of virus data for the HA gene based on the available compressors. The built-in compressors zlib and bzip are compared by using the Complearn Toolkit. And a comparison is made with respect to hierarchical and spectral clustering. Our results shows that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip compressor and, and the hierarchical clustering is a bit better than spectral clustering if all data are concerned.


©Copyright 2010 Authors