Avenue U

posts(42) comments(0) trackbacks(0)
  • BlogJava
  • 联系
  • RSS 2.0 Feed 聚合
  • 管理

常用链接

  • 我的随笔
  • 我的评论
  • 我的参与

留言簿

  • 给我留言
  • 查看公开留言
  • 查看私人留言

随笔分类

  • C++(1)
  • Core Java(2)
  • My Master-degree Project(33)
  • SSH(4)
  • struts2(1)

随笔档案

  • 2009年7月 (1)
  • 2009年6月 (41)

Core Java

最新随笔

  • 1. String Stream in C++
  • 2. Validators in Struts2
  • 3. An Interceptor Example in Strut2-Spring-Hibernate Application
  • 4. 3 Validators in Struts2-Spring-Hibernate
  • 5. Strut2-Spring-Hibernate under Lomboz Eclipse3.3
  • 6. Run Spring by Maven2 in Vista
  • 7. Appendix B
  • 8. 5 Conclusion
  • 9. 4.7 Sentence Rank on Yahoo News Page
  • 10. 4.6 Sentence Rankv

搜索

  •  

最新评论

阅读排行榜

评论排行榜

View Post

2.5.2 Word-Rank on Web Pages

Xiaojun Wang applied the word-rank algorithm to the LS extraction on web pages for finding lost or related web pages [11]. He also compared the results from this graph-based ranking algorithm with the traditional ways in extracting the terms from documents, such as TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and TFIDF4DF1. He pointed out, with word rank algorithm, which takes the semantic relation between words into account and chooses the most representative and salient words as Lexical Signatures, the highly relevant web pages can be found when the desired web pages can not be retrieved [11].

In their experiments, Wang not only used the basic word-rank algorithm, but also combined it with DF to select the terms. In [11], Wang only constructed undirected weighted graph model G=(V, E), V is the vertex set containing all words except stop words. E is a set of undirected and weighted edges. Each vertex’s initial score is the normalized  value and set the damping factor d=0.85. Wang did not use window size but WordNet [14] to recognize the semantically related words and Wang did not mention the value of weight on edges. These 2 detailed implementations are definitely related to a large amount of work but they were not listed in their paper “WordRank-Based Lexical Signatures for Finding Lost or Related Web Pages” [11].

   [11] 2-8

   [11] 2-9

Similar to 2-7 and set the convergence threshold to 0.0001, 2-8 is run on the graph until it converges based on 2-9.

In Wang’s experiments, he selected Google and MSN Search, randomly crawled 2000 URLs from domain DMOZ.org and kept 1337 pages after filtering out the unqualified pages, such as too short in content, non-HTML format like .pdf, .ps and .doc. Wang constructed each query with 5 terms by implementing TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1, WordRank, WordRank3DF2 and WordRank4DF1. Including the unique returned by SE, top 1 and top 10, the average success rate among 1337 pages are generally from 40% - 60%, for Google, except WordRank3Df2 which is a little higher than 60%. Meanwhile, the results from MSN show poor performance in TF, which is lower than 30%, TFIDF is lower than 40%, the others are between 40% and 60%.

Figure2.15 Retrieval performance of LS from Google search [11]

Figure2.16 Retrieval performance of LS from MSN live search [11]

In the summarization, Wang concluded that DF was the best method for uniquely identifying the desired documents; TF was easy to compute and did not need to be updated unless documents were modified; TFIDF and the hybrid method combining TFIDF and DF were good candidates for extracting the desired documents [11]. By computing the average cosine similarity values of top 10 returned pages with the desired page, WordRank based methods such as WordRank3DF2 are best for retrieving highly relevant documents [11].


posted on 2009-06-18 03:08 JosephQuinn 阅读(238) 评论(0)  编辑  收藏 所属分类: My Master-degree Project

新用户注册  刷新评论列表  

只有注册用户登录后才能发表评论。


网站导航:
博客园   IT新闻   Chat2DB   C++博客   博问   管理
相关文章:
  • Appendix B
  • 5 Conclusion
  • 4.7 Sentence Rank on Yahoo News Page
  • 4.6 Sentence Rankv
  • 4.5 Random pick sentence
  • 4.4 Word Rank
  • 4.3 Google search tips: meta keys and meta description
  • 4.2 Title
  • 4.1 The basics
  • 3.5 Deep Web Search Engine
 
 
Powered by:
BlogJava
Copyright © JosephQuinn