TCS-TR-A-05-2

Date: Thu Feb 17 18:41:45 2005

Title: Reputation Extraction Using Both Structural and Content Information

Authors: H. Hasegawa, M. Kudo and A. Nakamura

Contact:

First name: Atsuyoshi
Last name: Nakamura
Address: Graduate School of Information Science and Technology Hokkaido University Sapporo, 060-0814 Japan
Email: atsu@main.eng.hokudai.ac.jp

Abstract. We propose a new method of extracting texts related to a given keyword from Web pages collected by a search engine. By combining structural pattern matching and text classification, texts related to a given keyword such as reputations of a given restaurant can be extracted automatically from Web pages in unfixed sites, which is impossible by conventional wrappers. According to our cross validation results on extracting reputations of a given Ramen shop from Web pages collected by a search engine, our method achieved 79.3% precision and 56.6% recall by allowing acceptable errors.