IMPROVED FOCUSED CRAWLING USING BAYESIAN OBJECT BASED APPROACH

Document Type : Original Article

Authors

1 Dept. of computer Science and Eng., Faculty of Electronic Engineering, Menoufya University, EGYPT

2 Dept. of computer Science and Eng., Faculty of Electronic Engineering, Menoufya University, EGYPT.

3 Eng. Dept., NCRRT, Atomic Energy Authority, EGYPT

Abstract

The rapid growth of the World-Wide-Web made it difficult for general purpose search engines, e.g. Google and Yahoo, to retrieve most of the relevant results in response to the user queries. A vertical search engine specialized in a specific topic became vital. Building vertical search engines is accomplished by the help of a focused crawler. A focused crawler traverses the web selecting out relevant pages to a predefined topic and neglecting those out of concern. The focused crawler is guided toward those relevant pages through a crawling strategy. In this paper, a new crawling strategy is presented that helps building a vertical search engine. With this strategy, the crawler is kept focused to the user interests toward the topic. We build a model that describes the Web pages' features that distinguish relevant Web documents from those that are irrelevant. This is accomplished in the form of a supervised learning process, the web page is treated as an object having a set of features, and the features' values determine the relevancy of the web page through a Bayesian model. Results from practical experiments proved the efficiency of the proposed crawling strategy.

[1]     Sergey Brin and Lawrence Page. “The anatomy of a large-scale hyper textual web search engine”.Computer Networks, 30(1-7): pp107–117, 1998.
[2]     Soumen Chakrabarti, Martin van den Berg, and Byron Dom. “Focused crawling: A new approach to topic-specific web resource discovery”. Computer Networks, 31(11-16): pp1623–1640, 1999.
[3]     Michael Chau and Hsinchun Chen. “Comparison of Three Vertical Search Spiders”. IEEE Computer, 36(5): pp 56–62, 2003.
[4]     Yunming Ye, Fanyuan Ma, Yiming Lu, Matthew Chiu, and Joshua Huang. “iSurfer: A Focused Web Crawler Based on Incremental Learning from Positive Samples”. In APWeb, pp 122– 134, 2004.
[5]     Ganesh, S., Jayaraj, M., Kalyan, V., Murthy, S., and Aghila, G. 2004. “Ontology-based Web Crawler”. In Proceedings of the international Conference on information Technology: Coding and Computing (Itcc'04) Vol (April 05 - 07, 2004). ITCC. IEEE Computer Society, Washington, DC, 337.
[6]     George Almpanidis, Constantine Kotropoulos, and Ioannis Pitas. “Focused Crawling Using Latent Semantic Indexing - An Application for Vertical Search Engines”. In ECDL, pp 402–413, April 2005.
[7]     Milad Shokouhi, Pirooz Chubak, and Zaynab Raeesy. “Enhancing Focused Crawling with Genetic Algorithms”. In ITCC (2), pp 503–508,2005.
[8]     Donna Bergmark, Carl Lagoze, and Alex Sbityakov. “Focused crawls, tunneling, and digital libraries”. In ECDL, pp 91–106, 2002 .
[9]     Google Inc. “Google soap search api (beta)”.             http://code.google.com/apis/soapsearch/reference.html, 2007 (accessed September 10,2007).
[10]  M. F. Porter. “An algorithm for suffix stripping”. pp 313–316, 1997.
[11] Gerard Salton and Chris Buckley. “Term weighting approaches in automatic text retrieval”.Inf. Process. Manage., 24(5): pp 513–523, 1988.
[12] Internet archive. “Heritrix open source web crawler”. http://crawler.archive.org/, 2007 (accessed September 15, 2007).
[13] Andrew Kachites McCallum. “Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering”. http://www.cs.cmu.edu/~mccallum/bow, 2007 (accessed October 20,2007).
[14] Mohsen Jamali, Hassan Sayyadi, Babak Bagheri Hariri, and Hassan Abolhassani. “A method for focused crawling using combination of link structure and content similarity”. In Web Intelligence, pp 753–756, 2006.
[15] Chang Su, Yang Gao, Jianmei Yang, and Bin Luo. “An efficient adaptive focused crawler based on ontology learning”. In HIS, pp 73–78, 2005.
[16] Ismail Sengör Altingövde and Özgür Ulusoy. “Exploiting interclass rules for focused crawling”. IEEE Intelligent Systems, 19(6): pp 66–73, 2004.
[17] Na Luo, Wanli Zuo, Fuyu Yuan, and Changli Zhang. “A new method for focused crawler cross tunnel”. In RSKT, pp 632–637, 2006.