基于维基百科和网页相似度分析的主题爬行策略 联系客服

发布时间 : 星期四 文章基于维基百科和网页相似度分析的主题爬行策略更新完毕开始阅读bfba80b9900ef12d2af90242a8956bec0975a5d1

龙源期刊网 http://www.qikan.com.cn

基于维基百科和网页相似度分析的主题爬行策略

作者:栾霞 赵晓楠

来源:《现代电子技术》2014年第20期

摘 要: 针对当前常用爬虫爬行策略的不足,提出结合维基百科和网页相似度分析的主题爬行策略。利用维基百科分类树的结构对主题进行描述;下载网页后对网页进行相应处理,结合文本相关性和Web链接分析来计算候选链接的优先级。实验表明,该爬虫搜索结果与主题相关度明显高于传统爬虫,爬虫爬全率有一定提高。该主题爬虫主题描述方法和爬行策略有一定的推广价值,尤其在转基因生物领域中,该爬虫中有一定的创新性。 关键词: 维基百科; 文本相关性; 链接分析; 相似度计算

中图分类号: TN911?34; TP391.4 文献标识码: A 文章编号: 1004?373X(2014)20?0035?03

Topic crawling strategies based on Wikipedia and analysis of web?page similarity LUAN Xia1, ZHAO Xiao?nan2

(1. The Network Center, 323rd Hospital of Chinese People’s Liberation Army, Xi’an 710054, China; 2. Unit 68303 of PLA, Wuwei 733000, China)

Abstract: To overcome the weakness existing in the present topic crawling strategies, a topic crawling strategy based on Wikipedia and web?page similarity analysis is put forward in this paper. The Wikipedia classification tree structure is utilized to describe the topics, and then the

downloaded webs are properly handled. Finally, the priorities of the candidate links are calculated in combination with text relativity and analysis of Web links. The experimental result indicates that this new method is better than the traditional crawler in terms of searching results and topic relativity, and its climb rate has been increased. The theme description method and the crawl strategy have a certain promotion value, especially in the field of genetically modified organisms, the crawler has certain innovativeness.

Keywords: topic crawling; Wikipedia; text relativity; link analysis; similarity calculation 0 引 言

近年来随着因特网技术的发展与普及,网络上的信息量越来越大,如何高效地从网络上获得有用的资源变得至关重要。主题爬行器是解决这一问题的技术之一,它是在预定主题的指引