Before referencing
the related studies and works, the terminology “Lexical Signature” (LS) is quite
necessary to be mentioned first. LS is simply considered as an equivalent term to
“key words/terms/phrases” in chapter 1. There are many related works have given
LS various descriptions. Thomas A. Phelps and Robert Wilensky made this
definition: a relatively small set of such terms can effectively discriminate a
given document from all the others in a large collection [2]. They
also proposed a way to create LS that meets the desired criteria which is
selecting the first few terms of the document that have the highest "term
frequency-inverse document frequency" (TF-IDF) values [2]. Martin
Klein and Michael L. Nelson introduced the LS as a small set of terms derived
from a document that capture the “aboutness” of itself [3]. S. T. Park
studied and analyzed Phelps and Wilensky’s theory, and he claimed that, LS had
following characteristics by concluding from Phelps and Wilensky’s paper [4][5]:
(1) LSs should extract the desired document and
only that document [5].
(2) LSs should be robust enough to find
documents that have been slightly modified [5].
(3) New LSs should have minimal overlap with
existing LSs [5].
(4) LSs should have minimal search engine
dependency [5].
Seung-Park also
raised his own perspective about LS to help the user finding similar or
relevant documents:
(1) LSs should easily extract the desired
document. When a search engine returns more than one document, the desired
document should be the top-ranked documents [5].
(2) LSs should be useful enough to find
relevant information when the precise documents being searched for are lost [5].
After all, S. T. Park’s
studies on LS are very insightful and helpful in this project. If type “Lexical
Signature” as a search query into Google, then the first 10 results are most
likely going to have both of his 2 papers “Analysis of lexical signatures for
finding lost or related documents” [4] and “Analysis of lexical
signatures for improving information persistence on the www” [5].
S. T. Park
conducted a large amount of experiments with TF, DF, TFIDF, PW, TF3DF2, TF4DF1,
TFIDF3DF2, TFIDF4DF1 separately and combined them synthetically [4][5],
then, compared the results from Yahoo, MSN and AltaVista all in histograms.
Including unique result, 1st-rank result and top 10 results [5],
the success re-finding rate is more than 60% but less than 70% when take both 2
URLs match and 2 documents’ cosine value > 0.95 as a success re-finding into
consideration. Thus, if only taking 2 URLs comparison as a measurement and having
a success when they are matched, the success re-finding/re-locating rate would
be probably lower.
Figure2.1
Figure2.2
Figure2.3 [5]
In this project,
the LS’s definition follows S. T. Park’s theory: LSs are the key terms from the
web page and can help to both identify the web page from others uniquely and
retrieve the most relevant page effectively by search engines. Meanwhile, in
experiments, LS cannot be simply considered as the unchanged terms (words) from
the documents. Some necessary pre-procedures and transformations must be taken
before starting to process the web pages/documents in the information retrieval
ways, such as removing the stop words or transforming the words in different forms
but close meanings into one unique term, like “lexica” and “lexical” to “lex”. Other
than this, picking out only nouns and verbs or nouns and adjectives from the
text is also feasible based on word form data base. These steps are implemented
in Chapter 4 particularly by LUCENE and WORDNET, 2 open source Java projects
well accepted in practical industry world.