TCS-TR-A-06-20

Date: Wed Jul 19 01:25:12 2006

Title: Effcient Algorithms for Mining Maximal Flexible Patterns in Texts and Sequences

Authors: Hiroki Arimura and Takeaki Uno

Contact:

  • First name: Hiroki
  • Last name: Arimura
  • Address: Graduate School of Information Science and Technology, Hokkaido University Kita 14 Nishi 9, Sapporo 060-0814, Japan
  • Email: arim@ist.hokudai.ac.jp

Abstract. In this paper, we study the maximal pattern discovery problem in a given sequence for the class ERP of flexible patterns with applications to text mining, where a flexible pattern is a sequence of constant and wildcards for possibly empty strings such as AB*B*ABC, and also known as erasing regular patterns. We first discuss the framework of optimal pattern discovery for predictive mining and text classification, and then show its connection to maximal pattern discovery. Then, we introduce a new notion of maximality of patterns based on the position occurrences of patterns, called position-maximality. We present an efficient algorithm PosMaxFlexMotif that, given an input string of length n, enumerates all maximal patterns of ERP without duplicates in O(skmn^2) time per maximal pattern using O(mn) space, where s is the size of alphabet, m = |P| is the size of the pattern P to be enumerated, and k = O(m) is the number of variables in P. This implies as corollary that the position-maximal enumeration problem for flexible patterns is output-polynomial time solvable. Then, we apply the above result to maximal pattern discovery in terms of the maximality based on document occurrence as a sound pruning technique.


©Copyright 2006 Authors