The project is
designed to find a best query (LS query) which can represent the web page and be
searched by search engine easily. The web pages are picked up from the surface
web which is easily accessed without any authority restrictions. The
experiments are arranged in following general steps:
(1) Find a large amount of web pages which will
be used for LS extraction and re-finding/relocation, download them as test
pages into local disk and save their URLs at the same time. The source of web
pages used for experiments requires careful considerations as all kinds of web
sites showing up today, some of them are ill-formatted or have poor information.
Obviously, they are not the ideal sources in experiments. The detail of web
pages selection and crawling is non-trivial and will be introduced in section 3.3.1
specifically.
(2) HTML parsing and text extraction are needed
before extracting LSs. The preprocessing reasons and steps will be in section 3.3.2.
(3) Apply the algorithms in chapter 2 which are
designed to extract lexical signatures from these downloaded pages. In chapter 4,
TF, DF, TFIDF, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1 as well as word rank and sentence
rank will be applied to the pages after step2. But different from S. T. Park’s
experiments, in this project, varying term numbers in a query is accepted,
which make such as TF3DF2 and TFIDF3DF2 to and , the ratio in selecting between DF order terms and TF or TFIDF
order terms is not changed. For convenience, the selections on TF, DF and their
mixed forms are listed from (a) to (h) which have been precisely described in
S. T. Park’s paper. Although the topic of how many terms are going to help
getting better searching results is studied and unveiled by Martin and Michael,
they claim that 5 terms a query is the most efficient number in getting the
desired result appeared in top 10 from SE [3]. More terms in a query
means obviously more feasible in sentence-rank. Meanwhile, Google raised their
web search limit to 32 words 4 years ago back to 2005. In this project, up to 15
words are allowed per query, with succeeding words being ignored. The
performance with different number of terms in a query is an interesting topic
and they will be explored in chapter 4.
a)
TF: Select
terms in decreasing term frequency (TF) order. If there is a tie, then pick terms
based on increasing document frequency (DF). If there is another tie, randomly
select the terms [5].
b)
DF: Select
terms in increasing DF order. If there is a tie, then pick terms based on
decreasing TF order. If there is another tie, randomly select the terms [5].
c)
TFIDF:
Select terms in decreasing term-frequency inverse-document frequency (TFIDF)
order. If there is a tie, then pick terms based on increasing DF order. If
there is another tie, randomly select the terms [5].
d)
PW: Select
terms based on Phelps and Wilensky’s [3] method: First, list terms
in TF order and then pick LS in decreasing TFIDF order (i.e., decreasing TFIDF
order where the TF term is capped at five). If there is a tie, then pick terms
based on increasing DF order. If there is another tie, randomly select the terms
[5].
e)
TF3DF2:
Select two terms in increasing DF order. If there is a tie, then pick terms
based on decreasing TF order. If there is another tie, randomly select the terms.
Then filter out all terms which have DF value 1. Select three terms maximizing
TF. If there is a tie, it is resolved the same way as with TF method [5].
f)
TF4DF1:
Select one term based on increasing DF order first. Then filter out all terms
which have DF value 1. Select four terms maximizing TF [5].
g)
TFIDF3DF2:
Select two terms based on increasing DF order first. Then filter out all terms
which have DF value 1. Select three terms maximizing TFIDF [5].
h)
TFIDF4DF1:
Select one term based on increasing DF order first. Then filter out all terms
which have DF value 1. Select four terms maximizing TFIDF [5].
(4) Select several search engines which support
developer mode. Google web search and news search, Yahoo web search and news
search are picked as general search engines in the test. More details are in
section 3.2.
(5) Download the result links in first page returned
by SE as well as their corresponding HTML pages. If there is a URL match
between test URL and result URL, a successful retrieval is recorded. However if
there is no match between test URL and first 10 result URLs, then, comparison
between the test page and result pages from search engine is done to see if they
have the same topic or content. It is similar to step 2 because both of them
need HTML page parsing, extraction and converting them to understandable pure
text without noises from HTML tags. Also, the criterion of considering whether
2 pages are in the same topic or have similar content is another important issue
in this project. It has been introduced in the beginning of section 2.5. The
details will be discussed in section 3.4.
(6) Repeat step 2 to step 5 on all test pages
downloaded by step 1, count all the success retrievals and compute a success
retrieval rate which is in 3.1. This is the most straightforward measurement
which is also adopted in both S. T. Park and Xiaojun Wang’s researches. Martin
and Michael’s LS score is also a good way to evaluate the query generation,
however, their score is not as straightforward as 3.1
3.1
(7) A stable and reliable LS extraction
algorithm should help the success retrieval rate be higher than 90% as the ultimate
requirement. The traditional ways which do not maintain the linguistic meaning
such as TF shows quite different results comparing to graph-based rank
algorithms which maintain the linguistic meaning. For example, after step 3,
all terms are listed in alphabetic order and query’s terms are picked from a to
z. In chapter 4, the results from Google and Yahoo show both of those search
engines have a strong ability in returning desired pages by meaningful query
rather than discrete terms simply ordered from TF, IDF and so on, even some
stop words are included. The results and analysis will be in chapter 4.
Figure3.1
Figure3.1 is the flow chart showing the overall
procedure in experiments including from step 1 to step 7. The URL in the left
top rectangle of Figure3.1 is a directory built with a large amount
of effective URL which can be accessed online. The corresponding pages are then
downloaded and saved on local disk.
The algorithms in
extracting lexical signatures from a page are in 3 rectangles in the right top
of Figure3.1. They are the core in this project.
Another key programming issue is 2 rectangles in button of Figure3.1 page content comparison and results
record. The middle rectangles in Figure3.1’s left and right are a pipeline showing
all the operations.