Xiaojun Wang
applied the word-rank algorithm to the LS extraction on web pages for finding
lost or related web pages [11]. He also compared the results from
this graph-based ranking algorithm with the traditional ways in extracting the
terms from documents, such as TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and
TFIDF4DF1. He pointed out, with word rank algorithm, which takes the semantic
relation between words into account and chooses the most representative and
salient words as Lexical Signatures, the highly relevant web pages can be found
when the desired web pages can not be retrieved [11].
In their
experiments, Wang not only used the basic word-rank algorithm, but also
combined it with DF to select the terms. In [11], Wang only constructed
undirected weighted graph model G=(V, E), V is the vertex set containing all
words except stop words. E is a set of undirected and weighted edges. Each
vertex’s initial score is the normalized value and set the damping
factor d=0.85. Wang did not use window size but WordNet [14] to
recognize the semantically related words and Wang did not mention the value of
weight on edges. These 2 detailed implementations are definitely related to a
large amount of work but they were not listed in their paper “WordRank-Based
Lexical Signatures for Finding Lost or Related Web Pages” [11].
[11] 2-8
[11] 2-9
Similar to 2-7 and set the convergence threshold to
0.0001, 2-8 is run on the graph until it converges based
on 2-9.
In Wang’s
experiments, he selected Google and MSN Search, randomly crawled 2000 URLs from
domain DMOZ.org and kept 1337 pages after filtering out the unqualified pages,
such as too short in content, non-HTML format like .pdf, .ps and .doc. Wang
constructed each query with 5 terms by implementing TF, DF, TFIDF, PW, TF3DF2,
TF4DF1, TFIDF3DF2, TFIDF4DF1, WordRank, WordRank3DF2 and WordRank4DF1.
Including the unique returned by SE, top 1 and top 10, the average success rate
among 1337 pages are generally from 40% - 60%, for Google, except WordRank3Df2
which is a little higher than 60%. Meanwhile, the results from MSN show poor
performance in TF, which is lower than 30%, TFIDF is lower than 40%, the others
are between 40% and 60%.
Figure2.15 Retrieval performance of LS from Google search [11]
Figure2.16 Retrieval performance of LS from MSN live search [11]
In the
summarization, Wang concluded that DF was the best method for uniquely
identifying the desired documents; TF was easy to compute and did not need to
be updated unless documents were modified; TFIDF and the hybrid method
combining TFIDF and DF were good candidates for extracting the desired
documents [11]. By computing the average cosine similarity values of
top 10 returned pages with the desired page, WordRank based methods such as
WordRank3DF2 are best for retrieving highly relevant documents [11].