2.5.2 Word-Rank on Web Pages

View Post

Xiaojun Wang applied the word-rank algorithm to the LS extraction on web pages for finding lost or related web pages ^[11]. He also compared the results from this graph-based ranking algorithm with the traditional ways in extracting the terms from documents, such as TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and TFIDF4DF1. He pointed out, with word rank algorithm, which takes the semantic relation between words into account and chooses the most representative and salient words as Lexical Signatures, the highly relevant web pages can be found when the desired web pages can not be retrieved ^[11].

In their experiments, Wang not only used the basic word-rank algorithm, but also combined it with DF to select the terms. In [11], Wang only constructed undirected weighted graph model G=(V, E), V is the vertex set containing all words except stop words. E is a set of undirected and weighted edges. Each vertex’s initial score is the normalized value and set the damping factor d=0.85. Wang did not use window size but WordNet ^[14] to recognize the semantically related words and Wang did not mention the value of weight on edges. These 2 detailed implementations are definitely related to a large amount of work but they were not listed in their paper “WordRank-Based Lexical Signatures for Finding Lost or Related Web Pages” ^[11].

^[11] 2-8

^[11] 2-9

Similar to 2-7 and set the convergence threshold to 0.0001, 2-8 is run on the graph until it converges based on 2-9.

In Wang’s experiments, he selected Google and MSN Search, randomly crawled 2000 URLs from domain DMOZ.org and kept 1337 pages after filtering out the unqualified pages, such as too short in content, non-HTML format like .pdf, .ps and .doc. Wang constructed each query with 5 terms by implementing TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1, WordRank, WordRank3DF2 and WordRank4DF1. Including the unique returned by SE, top 1 and top 10, the average success rate among 1337 pages are generally from 40% - 60%, for Google, except WordRank3Df2 which is a little higher than 60%. Meanwhile, the results from MSN show poor performance in TF, which is lower than 30%, TFIDF is lower than 40%, the others are between 40% and 60%.

Figure2.15 Retrieval performance of LS from Google search ^[11]

Figure2.16 Retrieval performance of LS from MSN live search ^[11]

In the summarization, Wang concluded that DF was the best method for uniquely identifying the desired documents; TF was easy to compute and did not need to be updated unless documents were modified; TFIDF and the hybrid method combining TFIDF and DF were good candidates for extracting the desired documents ^[11]. By computing the average cosine similarity values of top 10 returned pages with the desired page, WordRank based methods such as WordRank3DF2 are best for retrieving highly relevant documents ^[11].

posted on 2009-06-18 03:08 JosephQuinn 阅读(238) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

2.5.2 Word-Rank on Web Pages