1. Introduction

View Post

WWW shows unprecedented expansion and innumerable styles which has led us to an “information technology age” for over a decade. Thus, the ways of how to efficiently and effectively retrieve the web documents according to the specific needs raise the huge development in web search technology. Indeed, searching is the second most popular activity on the web, behind e-mail, and about 550 million web searches are performed every day ^[1]. Google and Yahoo are now leading the most advanced technology in the web search industry, by knowing some words, phrases, sentences, names, addresses, email addresses, telephone numbers and date, the related web pages can be efficiently searched up, which significantly changes the traditional ways of obtaining the information. People are becoming professionals in web information retrieval even most of them have no idea about information retrieval theory in their minds and they are struggling in finding more related and closed to their desired pages. One of the key challenges focuses on the input query. Generally speaking, it is about how to effectively get better results by acquiring the certain key words/phrases as search queries in the first step. The other one focuses on the order of words in the query, same words but in different order can leads to quite different result URLs, normally, the word appearing in the first shows more importance than the others, therefore the results are not trying to match the separated words in query at the same time equally. Other than this primitive query, there are also several query grammars provided by many search engines, such as double quotation marks, logical ‘and’, ‘or’ and ‘not’ operation, they assure the same weight are put equally on the words when these logical operators are applied. Meanwhile, given the same key words as a search query, different search engines return different results, the measurements of how ‘good’ and how ‘comprehensive’ the results are drawing the attentions in the related web research, meanwhile, the results quality is judged by the users and it is not completed objective. Therefore, the measurements must be designed as much objective and persuasive as possible. This leads to the initiative in this project, an appropriate test system is built upon the web search engines by analyzing their results. Then the analysis focuses on 2 aspects, the URL quality and content quality. URL quality can only be tested by comparing the user’s target URL and the result URLs returned by search engine, and a good URL quality is only determined by 2 URL’s matching or not. Content quality is more loose and unrestricted by comparing the target URL’s content and the result URL’s content. A good content quality means high similarity between those 2. It is critical that, before the above 2 methods being applied, the ways how to get result URLs from search engine and maximize the quality measurements is leading the main task in this project which is equal to getting a query best summarize the page itself. In the following report, SE stands for Search Engine and LS stands for Lexical Signature.

There are various ways to extract key terms from the pure text file in information retrieval theory, which can be applied on web pages. In this project, taking key terms from a web page as search query to SE and then comparing the results from SE with the previous page is considered as a way to measure SE. Develop such a measurement which needs to be convincible and reliable is the key part of this project and equals to the studies around re-find/re-locate a given HTML’s URL. Because a valid web page online has a URL associate with it, there is an opportunity that the URL can be located by SE after extracting key terms from its text content into a query. It is not always the correct way because SE also takes links information into ranking consideration such as page rank algorithm. But it can show a high success rate in re-finding/re-locating the target URL by only focusing on the page itself, disregarding the global links structure. This re-finding/re-locating unique designated URL process excludes the subjective inferences and offers a good practice in the SE measurements.

In this report, the related researches and experiments such as removing the structural HTML tags, extracting text from HTML, download document frequency for each term from Google and count term frequency are practiced and tested, then, the query is constructed and returned URL results from search engine among various methods are compared. Chapter 2 lists related works and studies on the text processing methods for a single web page/document. Chapter 3 introduces the data set, search engine selection, HTML parsing and text extraction, the terms in term frequency order, document frequency order and graph-based rank order, query sending, result pages comparisons and success retrieval evaluations. Chapter 4 describes the detailed experiments setup, different term order’s algorithms and URL comparisons which are introduced in Chapter 3. All related results are recorded and shown in the histograms, followed by the comparisons and analysis upon the differences. Chapter 5 makes the conclusion, limitations and some potential improvements on theories and experiments in this project.

posted on 2009-06-15 04:48 JosephQuinn 阅读(264) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

1. Introduction