2.2 Martin Klein and Michael Nelson’s study on Lexical Signature

View Post

Researchers have spent a lot of efforts in exploring how many LSs can give a best result. Martin Klein and Michael L. Nelson conclude 5 to 7 LSs are good enough in robust hyperlinks ^[2] after extensive experiments. Martin and Michael did not only conclude LS is a small set of terms derived from a document which can capture the “aboutness” of that document ^[3], but also defined a LS from a web page can discover the page at a different URL as well as to find relevant pages on internet ^[3]. Through their experiments on huge amount of web pages from 1996 – 2007 which were downloaded from Internet Archive, http://www.archive.org/index.php, they claimed that 5-, 6- and 7-term LSs performed the best in returning the interested URLs among the top 10 from Google, Yahoo, MSN live, Internet Archive, European Archive, CiteSeer and NSDL ^[3]. By apply equation 2-1 to 2-2, the LS score versus number of terms in each query were derived in Figure2.4.

Figure2.4 LS Performance by Number of Terms ^[3]

Their experiments also showed that 50% URLs are returned as the top1 result, and 30% URLs were failed to re-locate/find by choosing LS in decreasing TF-IDF order ^[3] when they were reviewing Phelps and Wilensky’s research. Meanwhile, they also carefully studied the techniques for estimating IDF values which is a non-trivial issue in generating LS for the web pages. In their recent paper, 2008, “A comparison of techniques for estimating IDF values to generate lexical signatures for the web” ^[19], they introduced 3 quite different ways to estimate terms’ IDF and carefully examined their performances.

1. Local universe which was a set of pages downloaded from 98 websites, starting from 1996 to September, 2007 in each month ^[19].

2. Screen scraping Google web interface which was generated in January, 2008 ^[19].

3. Google N-Gram (NG) which was distributed in 2006 ^[19].

They compared these 3 IDF estimation techniques and claimed that local universe based data as well as the screen scraping based data is similar compared to their baseline, Google N-Gram based data.

Besides listing the detail percentage of success and fail to retrieve a URL, they used the following 2 equations in paper ^[3] to evaluate the score of LSs: fair score and optimistic score.

^[3]2-1

^[3]2-2

R(i) shows the ith page’s rank returned by SE after sending the query, when it gets bigger value, the fair score will be lower, N is the total sample pages in their experiments which is 98 and is the average value.

^[3] 2-3

^[3] 2-4

In the optimistic score equation, S_opt is different from S_fair which is only determined by pages’ rank. is the average fair score value.

They set R_max = 100 which makes S_fair can always be positive if the desired page appears in first 100 results from SE. If R(o) > R_max, when the desired page does not appear in first 100 results, then simply set S_fair = 0 and S_opt = 0. The final results of scores were from 2 terms to 15 terms per query and scores ranged from 0.2 to 0.8. They also concluded the scores on one page since year 1996 to 2007 ranged from 0.1 to 0.6 ^[3]. More details and score curves in their paper are not included in this project report.

posted on 2009-06-15 06:27 JosephQuinn 阅读(255) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

2.2 Martin Klein and Michael Nelson’s study on Lexical Signature