5
Conclusion
5.1
Summaries
DF, TF3DF2,
TF5DF5, TFIDF3DF2, TFIDF5DF5, WordRank3DF2 and WordRank5DF5’s overall average success retrieve rate
reach above 70% from Google web search. Meanwhile, directed forward sentence
rank achieves better results from Yahoo search.
At the same time,
sentence related methods, including random sentence pick, undirected-graph
sentence rank, directed-forward sentence rank and directed-backward sentence
rank’s maximum success retrieval rate reach above 74%, more than other
terms-based algorithms. The average success rate from Google web search is all
higher than Yahoo web search except directed-forward sentence rank with 68.44%
from Google and 68.99% from Yahoo, with less than 0.6% difference. When terms
in a query are less than 5, the high ratio DF-related algorithms such as DF,
TF3DF2, TF5DF5, TFIDF3DF2, TFIDF5DF5, WordRank3DF2 and WordRank5DF5 have higher
success retrieval rate than the others.
|
Method
|
Google
|
Yahoo
|
1
|
Title
|
94.67
|
42.07%
|
84.00
|
37.33%
|
2
|
TF
|
88.33
|
39.26%
|
80.67
|
35.85%
|
3
|
DF
|
157.33
|
69.93%
|
138.67
|
61.63%
|
4
|
TFIDF
|
106.33
|
47.26%
|
96.00
|
42.67%
|
5
|
PW
|
89.00
|
39.56%
|
78.67
|
34.96%
|
6
|
TF3DF2
|
139.67
|
62.07%
|
130.00
|
57.78%
|
7
|
TF4DF1
|
95.33
|
42.37%
|
89.67
|
39.85%
|
8
|
TF5DF5
|
143.67
|
63.85%
|
133.00
|
59.11%
|
9
|
TFIDF3DF2
|
142.67
|
63.41%
|
134.00
|
59.56%
|
10
|
TFIDF4DF1
|
112.00
|
49.78%
|
104.00
|
46.22%
|
11
|
TFIDF5DF5
|
147.67
|
65.63%
|
137.67
|
61.19%
|
12
|
WordRank
|
71.00
|
31.56%
|
67.33
|
29.93%
|
13
|
NounVerbsRank
|
51.00
|
22.67%
|
47.33
|
21.04%
|
14
|
WordRank3DF2
|
148.00
|
65.78%
|
136.33
|
60.59%
|
15
|
WordRank4DF1
|
90.00
|
40.00%
|
84.67
|
37.63%
|
16
|
WordRank5DF5
|
145.33
|
64.59%
|
134.00
|
59.56%
|
17
|
WordRank3TFIDF2
|
59.33
|
26.37%
|
57.00
|
25.33%
|
18
|
WordRank4TFIDF1
|
66.33
|
29.48%
|
61.33
|
27.26%
|
19
|
WordRank5TFIDF5
|
57.00
|
25.33%
|
50.67
|
22.51%
|
20
|
RandomSentence
|
112.00
|
49.78%
|
105.33
|
46.81%
|
21
|
SentenceRank
|
111.00
|
49.33%
|
117.00
|
52%
|
22
|
ForwardSentence
|
109.67
|
48.74%
|
121.33
|
53.93%
|
23
|
BackwardSentence
|
105.33
|
46.81%
|
103.00
|
45.78%
|
Table5. 1 Average Success Rate from 3-, 4- and 5-terms in a
query
Figure5.1
|
Method
|
Google
|
Yahoo
|
1
|
Title
|
120.60
|
53.60%
|
106.00
|
47.11%
|
2
|
TF
|
157.60
|
70.04%
|
129.20
|
57.42%
|
3
|
DF
|
160.80
|
71.47%
|
127.00
|
56.44%
|
4
|
TFIDF
|
168.00
|
74.67%
|
134.00
|
59.56%
|
5
|
PW
|
159.00
|
70.67%
|
123.60
|
54.93%
|
6
|
TF3DF2
|
166.40
|
73.96%
|
138.80
|
61.69%
|
7
|
TF4DF1
|
168.20
|
74.76%
|
141.80
|
63.08%
|
8
|
TF5DF5
|
163.60
|
72.71%
|
134.40
|
59.73%
|
9
|
TFIDF3DF2
|
165.60
|
73.60%
|
144.00
|
64.00%
|
10
|
TFIDF4DF2
|
169.20
|
75.20%
|
144.20
|
64.09%
|
11
|
TFIDF5DF5
|
163.80
|
72.80%
|
138.00
|
61.33%
|
12
|
WordRank
|
154.20
|
68.53%
|
130.40
|
57.96%
|
13
|
NounVerbsRank
|
136.40
|
60.62%
|
125.00
|
55.56%
|
14
|
WR3DF2
|
162.80
|
72.36%
|
130.60
|
58.04%
|
15
|
WR4DF1
|
170.40
|
75.73%
|
148.60
|
66.04%
|
16
|
WR5DF5
|
163.40
|
72.62%
|
134.40
|
59.73%
|
17
|
WR3TFIDF2
|
154.20
|
68.53%
|
127.60
|
56.71%
|
18
|
WR4TFIDF1
|
158.80
|
70.58%
|
133.40
|
59.29%
|
19
|
WR5TFIDF5
|
148.80
|
66.13%
|
122.40
|
54.40%
|
20
|
RandomSentence
|
171.20
|
76.09%
|
166.00
|
73.78%
|
21
|
SentenceRank
|
165.40
|
73.51%
|
162.40
|
72.17%
|
22
|
ForwardSentence
|
171.20
|
76.09%
|
169.80
|
75.47%
|
23
|
BackwardSentence
|
166.20
|
73.87%
|
159.20
|
70.76%
|
Table5. 2 Average Success Rate from 11-, 12-, 13, 14- and
15-terms a query
Figure5.2
The difference
between Google and Yahoo is smaller when random sentence, undirected-graph
sentence rank, forward sentence rank and backward sentence rank are applied,
which is all smaller than 5%, while all the others exceed 5%.
5.2
Limitations
The LS performance
largely depends on search engines and web page extraction in this project.
Google is well-known for its page-rank strategy rather than other pure text
ranking strategy deployments, which is adopted by AltaVista [12].
Meanwhile, there are other links-based ranking strategy such as authorities and
hubs, which also take link information as a key element in ranking the URLs. During
ranking process, the term information like TF or DF probably cannot help the
page itself to get a high rank when there are a large portion of pages all have
the terms in the search query. In this project, all the URL matching is only taking
from the first 10 URLs, which do not only satisfy terms matching , but also the
link directions, the top 10 results are certainly referenced by many other
similar content pages after ranking. Therefore, trying using less than 15 terms
to summarize over a web page could have a lot of overlap with the other pages,
then the success retrieval would depend on the ranking algorithms which take
link direction into consideration in a large portion without pure text
analysis. The chance of the original page appearing in the top 10 URLs from SE
is limited.
5.2.1
HTML Parsing and Text
Extraction
The other challenge
comes from the beginning of the experiment: HTML parsing and text extraction.
The more accurate text extraction brings higher success retrieval rate. The
success rate largely depends on how well the extraction of a given web page can
be. Focusing on the main content or concentrating on the topic from a web page
now is a non-trivial issue. In the following of this report, I pick up some
most viewed news today as examples to show the way of extracting the topic or
main content from a web page.
The experiments in
reports show a lot of news web pages use <div> to divide their content.
This is based on that the text in different division can be extracted according
to human visual cues. Here are 2 pages from Yahoo and ABC News.
(a) (b)
Figure5.3
In Figure5.3, only the text in the red box is the main
content and the text in other color’s boxes is either navigation links or
advertisement links. Fortunately, not only Yahoo and ABC, lots of other news websites
such as CNN, BBC and Google News use <div> to separate the web page
content.
Next step is how
to prove a given text is the main content among phrases, passages, blocks and
links. Now I am considering classifying the web page first. I make 2 categories:
based on text and based on links. The first one is the same as Figure5.3 (a) and (b) above; the second one is home
page or index page, such as Figure5.4
Figure5.4
There are lots of titles
with links but barely paragraphs or passages in Figure5.4. A simple way can be applied to classify
these 2 kinds of pages: the ratio between
the text within links and the text without links. As Figure5.4 shows that almost 90% percent of text is
also surrounded by <A>, the anchor tags. But in Figure5.3 (a) and (b), there is a large part of text
without anchor tags which shows the main content perfectly. If it is a content based page, then the main
content excluded the navigation and advertisement should be focused and
extracted properly. If it is a link based
page, then, not only the phrases in the anchor text, but also the URL
address in anchor text is also a significant part which can not be simply
removed during the extraction.
Figure5.5
The noises from
the copyright in the web page such as in Figure5.5 can also cause distractions. In order to have a more accurate query
generation method, the text from the main content or topic must be extracted and
maintained according to the types and positions of the web page.
5.2.2
Solution
(a) (b)
Figure5.6
Figure5.6 is a demo of the software called “Crunch” which was
developed in Columbia
University [20].
It is a good template of extracting and converting a web page into a pure text
version according to its own structure. In Figure5.6-(b), the page only has different font style, font
size and HREF, which are good enough for future processing, such as Michal
Cutler’s theory. Due to the time limitation, the experiments on HTML
extractions have not gone as deep as “Crunch”. Meanwhile, because of the
ranking factor taking link information, the belonging domain can be part of the
query, it helps more accurate locate on where the page is, however, the query’s
term number grows along with combining an HREF address into a query. The
performance of query with domain or link has not been widely test yet.