If there is a URL
match or content match, a success retrieval is established. If the URL does not
match, due to URL’s changing all the time [2][3], comparison between
original page and retrieved pages is indispensable and taken by 2 ways,
manually and automatically. Manually checking all the content between original
page and retrieved pages is time consuming but it can guarantee the precise
results. In this project, we pick around 200 pages from the data source for
manual checking. Rather than by brute force, automatic comparison between the
result pages from search engine and each test page also needs HTML page
preprocessing as in step 2.
3-1
3-2
In 3-2, TFw is
the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied
on pages, therefore, the comparison between 2 pages is only focusing on the
main content which means all the advertisement, copyrights information, sponsor’s
links and information are removed. It can be concluded as finding a similar topic
within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence
rank algorithm, the highest ranking sentence can be picked up, input as a query
into SE and then compared to the result page.
Figure3.22 (a)
and (b) is an example of proving the validity of
cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b),
it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a)
was downloaded in the morning on March 2, 2009 and Figure3.22 (b)
was downloaded at noon on the same day. Apparently, Yahoo news editors keep
updating and modifying the same news, so the later one gives some differences
in the content but actually they are talking about the same issue.
Figure3.22 shows
the downloaded HTML file images and (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm
The retrieval URL
is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43
By comparing the
different URL, it is obviously that even about the same content, yahoo news
changes URL by adding “_43” in
the end.
(a) (b)
Figure3.22
(a) (b)
Figure3.23
Figure3.23 (a) and (b) is an example of finding a similar content web page,
according to a downloaded local page Figure3.23 (a). Obviously, they
are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular
news at the time of this experiment.
Figure3.23 (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl
Figure3.23 (b)’s URL is
http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003
The documents’ similarity is 98.38% by 3-2
Figure3.24 (a)
and (b) is another example of
finding a similar content web page according to a downloaded local page. They
are both talking the children’s blood lead level.
Figure3.24 (a)’s URL is:
http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy
Figure3.24 (b)’s URL is
http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9
The documents similarity is 94% by 3-2.
(a) (b)
Figure3.24