In the previous
sections, all the methods of extracting LSs disregard the natural language’s
consistency in the web pages by only considering the discrete terms. The
document’s terms are arranged in alphabetic order before applying TF or DF, which
totally destroys the linguistic information in web pages. Take an example from Thomas
A. Phelps and Robert Wilensk’s paper [3],
http://www.cs.berkeley.edu/˜wilensky/NLP.html cannot be located using this query “texttiling
wilensky disambiguation subtopic iago” in Google at current time, as shown in Figure2.8, Google only return 4 results with highly
related content but none of them have the same URL as required, however it was
claimed as successful in January, 2008. Meanwhile it can be returned by Yahoo
search with the same query but a different address like this in Figure2.9:
http://www.eecs.berkeley.edu/Faculty/Homepages/wilensky.html/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago
This shows the
fact that 2 different URLs can open the same page. About different URLs binding
on the same web page, it is not discussed in this project. It is supposed to be
a successful retrieval if document similarity is taken as a measurement,
however, the URL matching measurement will clearly put it into a false
retrieval.
Figure2.8
Figure2.9
From Figure2.8 and Figure2.9, we can see the un-stable performance from
traditional LS generation techniques even when they were as typical samples as in
papers years before. The studies on URL and its page content change have been
researched before, such as Martin Klein and Michael L. Nelson studied the pages
ranging from 1996 to 2007 [3], but they are not included in this
project. One single page bound by different URL actually happens quite often,
taking Binghamton
University’s home page as
an example, http://www.binghamton.edu/index.html
and http://www2.binghamton.edu actually
connect to the same page. In chapter 3, section 3.4 shows typical examples in
Yahoo news pages that Yahoo changes the same news page’s URL all the time.
One approach of solving
such kind of difficulty, which mainly focuses on finding relative or similar
web pages rather than locating web pages only by URLs matching, is referenced
by automatically generating key words and summarizations for academic papers. They
are introduced as having capabilities by reserving both the underlying language
information and relatively stable performance, because there are fewer chances to
have two documents with the same content but different URLs, and even this
happens, Martin and Michael studied the graph-based algorithms and concluded
that they actually had the ability to improve the relative/similar web page
re-finding/re-location when the original copy is lost [3].
Graph-based
ranking algorithm is a way of deciding the importance of a vertex within a graph,
by taking all text as global information and recursively computing from the
entire graph rather than relying only on local vertex-specific information [9][10].The
basic idea implemented by a graph-based ranking model is a vertex can receive
and cast “voting” or “recommendation” to the others [12][13]. When
one vertex links to another one, it is basically casting a vote to the other in
the graph. The higher the number of votes is received by a vertex, the higher
the importance of the vertex is taken [9]. Figure2.10 (a) is an example showing how it works
when a vertex casts all its weight to the other vertices. Figure2.10 (b) is an example a vertex casts 90% of
its weight to the others while it keeps 10%. Page-Rank is a typical
implementation of this graph-based ranking algorithm. The score of a vertex Vi
is defined as:
[11] 2-5
Where d is a
damping factor that can be set between 0 and 1. In 2-5, j is the vertex points to i and Out(Vj)
is the score delivered from j to i.
(a) (b)
Figure2.10
Meanwhile, graph-based
ranking algorithm can also be split into 2 groups as Figure2.11 shows the combinations: weighted (on
edges) graph and un-weighted (on edges) graph, undirected-graph
and directed-graph. One group’s condition can be combined with the other group’s
condition.
Figure2.11
Because undirected
un-weighted (on both edges and vertices) graph does not have actual meaning in
this project, it is not discussed. Figure2.10 (a) and (b) are 2 examples of directed graph,
with weights on vertex, but without value on the edges. Figure2.12 is an example of undirected weighted (on
edges) graph. It is the case that assuming out-degree of a vertex is equal to
the in-degree of the vertex, take undirected edges like bi-directions edges [9][10]..
The weight from i to j and from j to i are same. The weight’s recursive
computation formula:
[9] 2-6
and show Vi, Vj
and Vk are connected but cannot show any direction among Vi,
Vj and Vk. Figure2.13 is an example of directed weighted graph.
It is the case that a weight is added according to the direction from one
vertex to another. The weight from i to j is wij, but the weight
from j to i is 0.
Figure2.12
Figure2.13
[9] 2-7
Compared to the
un-directed weighted graph’s formula 2-6, and in formula 2-7 show the direction between Vj,
Vi and Vk, Vj.