WWW shows
unprecedented expansion and innumerable styles which has led us to an
“information technology age” for over a decade. Thus, the ways of how to
efficiently and effectively retrieve the web documents according to the
specific needs raise the huge development in web search technology. Indeed,
searching is the second most popular activity on the web, behind e-mail, and
about 550 million web searches are performed every day [1]. Google
and Yahoo are now leading the most advanced technology in the web search
industry, by knowing some words, phrases, sentences, names, addresses, email
addresses, telephone numbers and date, the related web pages can be efficiently
searched up, which significantly changes the traditional ways of obtaining the information.
People are becoming professionals in web information retrieval even most of
them have no idea about information retrieval theory in their minds and they
are struggling in finding more related and closed to their desired pages. One
of the key challenges focuses on the input query. Generally speaking, it is
about how to effectively get better results by acquiring the certain key words/phrases
as search queries in the first step. The other one focuses on the order of
words in the query, same words but in different order can leads to quite
different result URLs, normally, the word appearing in the first shows more
importance than the others, therefore the results are not trying to match the
separated words in query at the same time equally. Other than this primitive
query, there are also several query grammars provided by many search engines,
such as double quotation marks, logical ‘and’, ‘or’ and ‘not’ operation, they
assure the same weight are put equally on the words when these logical
operators are applied. Meanwhile, given the same key words as a search query,
different search engines return different results, the measurements of how
‘good’ and how ‘comprehensive’ the results are drawing the attentions in the
related web research, meanwhile, the results quality is judged by the users and
it is not completed objective. Therefore, the measurements must be designed as much
objective and persuasive as possible. This leads to the initiative in this
project, an appropriate test system is built upon the web search engines by
analyzing their results. Then the analysis focuses on 2 aspects, the URL
quality and content quality. URL quality can only be tested by comparing the
user’s target URL and the result URLs returned by search engine, and a good URL
quality is only determined by 2 URL’s matching or not. Content quality is more
loose and unrestricted by comparing the target URL’s content and the result
URL’s content. A good content quality means high similarity between those 2. It
is critical that, before the above 2 methods being applied, the ways how to get
result URLs from search engine and maximize the quality measurements is leading
the main task in this project which is equal to getting a query best summarize
the page itself. In the following report, SE stands for Search Engine and LS
stands for Lexical Signature.
There are various
ways to extract key terms from the pure text file in information retrieval
theory, which can be applied on web pages. In this project, taking key terms
from a web page as search query to SE and then comparing the results from SE
with the previous page is considered as a way to measure SE. Develop such a
measurement which needs to be convincible and reliable is the key part of this
project and equals to the studies around re-find/re-locate a given HTML’s URL. Because
a valid web page online has a URL associate with it, there is an opportunity that
the URL can be located by SE after extracting key terms from its text content
into a query. It is not always the correct way because SE also takes links
information into ranking consideration such as page rank algorithm. But it can
show a high success rate in re-finding/re-locating the target URL by only
focusing on the page itself, disregarding the global links structure. This
re-finding/re-locating unique designated URL process excludes the subjective inferences
and offers a good practice in the SE measurements.
In this report, the related
researches and experiments such as removing the structural HTML tags,
extracting text from HTML, download document frequency for each term from
Google and count term frequency are practiced and tested, then, the query is
constructed and returned URL results from search engine among various methods
are compared. Chapter 2 lists related works and studies on the text processing
methods for a single web page/document. Chapter 3 introduces the data set,
search engine selection, HTML parsing and text extraction, the terms in term
frequency order, document frequency order and graph-based rank order, query
sending, result pages comparisons and success retrieval evaluations. Chapter 4 describes
the detailed experiments setup, different term order’s algorithms and URL comparisons
which are introduced in Chapter 3. All related results are recorded and shown
in the histograms, followed by the comparisons and analysis upon the differences.
Chapter 5 makes the conclusion, limitations and some potential improvements on
theories and experiments in this project.