As stated in
chapter 3, the sentence rank can significantly improve the linguistic
summarization other than the traditional TF or DF methods. Considering the complexity
in sentence rank, randomly pick a sentence and take the first 3 to 15 words
from the sentence within its original order as search query can avoid the
iterations in graph-based ranking algorithm, and the results below show that
even the sentences are randomly picked, when the number of terms up to 10, the
performance increases enormously, some of them are higher than 75%, which cannot
be accomplished easily by the previous carefully designed retrieval algorithms.
Random Sentence
|
Google
|
Yahoo
|
3
|
88.00
|
39.11%
|
90.00
|
40.00%
|
4
|
114.00
|
50.67%
|
102.00
|
45.33%
|
5
|
134.00
|
59.56%
|
124.00
|
55.11%
|
6
|
150.00
|
66.67%
|
144.00
|
64.00%
|
7
|
155.00
|
68.89%
|
137.00
|
60.89%
|
8
|
162.00
|
72.00%
|
154.00
|
68.44%
|
9
|
161.00
|
71.56%
|
144.00
|
64.00%
|
10
|
168.00
|
74.67%
|
151.00
|
67.11%
|
11
|
168.00
|
74.67%
|
151.00
|
67.11%
|
12
|
170.00
|
75.56%
|
168.00
|
74.67%
|
13
|
172.00
|
76.44%
|
168.00
|
74.67%
|
14
|
171.00
|
76.00%
|
169.00
|
75.11%
|
15
|
175.00
|
77.78%
|
174.00
|
77.33%
|
Average
|
152.92
|
67.97%
|
144.31
|
64.14%
|
Table4.24
(a) (b)
Figure4.26 Random Sentence Pick from Google and Yahoo results
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
The text in HTML’s
title tag is always playing a vital role in web page retrieval. During the
beginning of this project, an extensive amount of experiments were conducted by
using the title method. It was believed that the success rate would reach 90%
from using title text as a query if the query could be composed carefully and
properly. Figure4.12 shows that the title method also has a
good stability along with the words number in a query. It is important to
mention that, from Figure4.1 to Figure4.10, although the classic methods have better
results, it only means the HTML extractions have good performance, which filter
the structural HTML tags and functional scripts which could be big distractions
in the following application on the target page, because all the basic
retrieval process is only designed for pure text without structural tags. For
example, HTML tags like ‘td’ and ‘tr’ will have a big term frequencies and the
function or variable names in Javascript will cause a very low document
frequencies, if they are not filtered or removed in the pre-processing step.
However, by using title method, it is much easier to extract the text
information only between <title> and </title>.
Title tag
|
Google
|
|
Yahoo
|
|
3
|
82.00
|
36.44%
|
72
|
32.00%
|
4
|
91.00
|
40.44%
|
86.00
|
38.22%
|
5
|
111.00
|
49.33%
|
94.00
|
41.78%
|
6
|
116.00
|
51.56%
|
99.00
|
44.00%
|
7
|
116.00
|
51.56%
|
102.00
|
45.33%
|
8
|
115.00
|
51.11%
|
102.00
|
45.33%
|
9
|
115.00
|
51.11%
|
101.00
|
44.89%
|
10
|
115.00
|
51.11%
|
101.00
|
44.89%
|
11
|
115.00
|
51.11%
|
102.00
|
45.33%
|
12
|
117.00
|
52.00%
|
102.00
|
45.33%
|
13
|
118.00
|
52.44%
|
103.00
|
45.78%
|
14
|
126.00
|
56.00%
|
111.00
|
49.33%
|
15
|
127.00
|
56.44%
|
112.00
|
49.78%
|
Average
|
112.62
|
50.05%
|
99.00
|
44.00%
|
Table4.11
(a) (b)
Figure4.12 Use title terms as search query
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
A real
implementation from this project is whether the ability of testing on general
search engine can be applied on testing deep web search engine. The general
search engines such as Google and Yahoo have been widely approved in their
proper results and links. However, many sites may not allow their documents to
be indexed but instead may allow the documents to be accessed through their
search engines only, these sites are part of the so-called Deep Web [1][17].
The deep web search engines which only focus on their own data base and pages,
data or documents which are kept privately and cannot be searched by general
search engines. Take www.taobao.com as an
example, it is a online commercial trading site like www.ebay.com, Taobao apparently abandons
general search engines such as www.baidu.com
and www.google.com to access its
commodities results after the negotiations broken with the big search engine
companies. This leads people who want commodity and price information have to
go directly to Taobao’s own search engine interface and browse result items in
Taobao’s website. Obviously, search engines in Taobao are probably developed by
their own or contract consultant software teams, the performance then will be
an interesting topic rather than the ones generally accepted by the public such
as Google and Yahoo. The specific introduction for deep web and implementation
of deep web search engines are not part of this project, but the practical
value from this project can offer a feasible way in testing local and small
search engines embedded in their own web sites.
If there is a URL
match or content match, a success retrieval is established. If the URL does not
match, due to URL’s changing all the time [2][3], comparison between
original page and retrieved pages is indispensable and taken by 2 ways,
manually and automatically. Manually checking all the content between original
page and retrieved pages is time consuming but it can guarantee the precise
results. In this project, we pick around 200 pages from the data source for
manual checking. Rather than by brute force, automatic comparison between the
result pages from search engine and each test page also needs HTML page
preprocessing as in step 2.
3-1
3-2
In 3-2, TFw is
the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied
on pages, therefore, the comparison between 2 pages is only focusing on the
main content which means all the advertisement, copyrights information, sponsor’s
links and information are removed. It can be concluded as finding a similar topic
within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence
rank algorithm, the highest ranking sentence can be picked up, input as a query
into SE and then compared to the result page.
Figure3.22 (a)
and (b) is an example of proving the validity of
cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b),
it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a)
was downloaded in the morning on March 2, 2009 and Figure3.22 (b)
was downloaded at noon on the same day. Apparently, Yahoo news editors keep
updating and modifying the same news, so the later one gives some differences
in the content but actually they are talking about the same issue.
Figure3.22 shows
the downloaded HTML file images and (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm
The retrieval URL
is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43
By comparing the
different URL, it is obviously that even about the same content, yahoo news
changes URL by adding “_43” in
the end.
(a) (b)
Figure3.22
(a) (b)
Figure3.23
Figure3.23 (a) and (b) is an example of finding a similar content web page,
according to a downloaded local page Figure3.23 (a). Obviously, they
are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular
news at the time of this experiment.
Figure3.23 (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl
Figure3.23 (b)’s URL is
http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003
The documents’ similarity is 98.38% by 3-2
Figure3.24 (a)
and (b) is another example of
finding a similar content web page according to a downloaded local page. They
are both talking the children’s blood lead level.
Figure3.24 (a)’s URL is:
http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy
Figure3.24 (b)’s URL is
http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9
The documents similarity is 94% by 3-2.
(a) (b)
Figure3.24
In chapter2,
section2.1, S. T. Park adopted 5 terms a query. However, a wider range of term
numbers in a query is adopted in this project: the length of LS from 3 to 15
versus the success rate is compared together while sentence rank, take first N
words in the selected sentence, from 3 to 15, even including stop words from
the top ranked sentences as a search query and remove the rest of them left in
the sentences. This procedure does not follow the traditional ways in text
retrieval, however, in chapter 5, the experiments show even better results when
the terms number are more than 10, compared to the same terms number in
traditional ways.
It is necessary at
this point to clarify why LS extraction cannot be applied directly on the raw
web page which are downloaded in full size without any parser. The first is,
other than pure text information retrieval, the web pages have their unique
feature, HTML tags, which help to construct page template, font format, font
size, images insertion and other components for a fancy appearance. However,
these good looking gadgets in the web pages actually are the sources of
distractions and interferences when the applications are trying to analyze
them. Because only the showing text part in a web page is useful in common
sense. How to transfer the HTML page to pure text by removing all kinds of
hidden tags is a key issue to the following steps and decide the final results.
The text in the page must be all extracted at first, meanwhile, the tags
information behind the text can not be simply discarded, for example, in Michal’s
research, she classified and saved the text into 6 different categories, each
category takes a unique weight. The second is, the link information is also a
powerful hint in deciding the unique feature of a particular web page. For
example, the commercial search engines largely depend on the algorithms like
page-rank and authority and hubs. Even for searching and retrieval studies in
academic papers, the citation rank algorithm is also widely accepted. However,
not same like academic papers, which contain the citations in the end of each
paper as a references chapter, web pages’ link information hides in the anchor
tags, which leads to more complicated data-source preprocessing before LS
extraction. Construct a query with extracting the link information, such as the
domain that the page belongs to, combined with LS could be another study but
not included in this report.
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文