It is necessary at
this point to clarify why LS extraction cannot be applied directly on the raw
web page which are downloaded in full size without any parser. The first is,
other than pure text information retrieval, the web pages have their unique
feature, HTML tags, which help to construct page template, font format, font
size, images insertion and other components for a fancy appearance. However,
these good looking gadgets in the web pages actually are the sources of
distractions and interferences when the applications are trying to analyze
them. Because only the showing text part in a web page is useful in common
sense. How to transfer the HTML page to pure text by removing all kinds of
hidden tags is a key issue to the following steps and decide the final results.
The text in the page must be all extracted at first, meanwhile, the tags
information behind the text can not be simply discarded, for example, in Michal’s
research, she classified and saved the text into 6 different categories, each
category takes a unique weight. The second is, the link information is also a
powerful hint in deciding the unique feature of a particular web page. For
example, the commercial search engines largely depend on the algorithms like
page-rank and authority and hubs. Even for searching and retrieval studies in
academic papers, the citation rank algorithm is also widely accepted. However,
not same like academic papers, which contain the citations in the end of each
paper as a references chapter, web pages’ link information hides in the anchor
tags, which leads to more complicated data-source preprocessing before LS
extraction. Construct a query with extracting the link information, such as the
domain that the page belongs to, combined with LS could be another study but
not included in this report.