TCS-TR-B-10-8Date: Wed Dec 22 17:07:43 2010 Title: Master's Thesis: Virus Data Clustering based on Kolmogorov Complexity Authors: Yu Zhu Contact:
Abstract. In this paper, we focus on one simple data mining method called Normalized Compression Distance (NCD) which has been suggested by Cilibrasi Vitányi. By this method, we analyzed the HA sequences of virus data for the HA gene based on the available compressors. The built-in compressors zlib and bzip are compared by using the Complearn Toolkit. And a comparison is made with respect to hierarchical and spectral clustering. Our results shows that one can obtain an (almost) perfect clustering. It turned out that the zlib compressor allowed for better results than the bzip compressor and, and the hierarchical clustering is a bit better than spectral clustering if all data are concerned. ©Copyright 2010 Authors |