In 1997, Michal
Cutler proposed a method that makes use of structures and hyperlinks of HTML
documents to improve the effectiveness of retrieving HTML documents [6].
She classified the HTML into categories based on HTML’s tags, such as Title,
H1, H2, H3, H4, H5, H6 and so on, and claimed that the terms in different HTML
tags have different weight. Based on this idea, a new method for extracting lexical
signatures from a web page can use the terms that have the highest weights that
are computed with the HTML tag structures taken into consideration [6].
It is quite
necessary to outline Cutler’s two papers both: “Using the Structure of HTML
Documents to Improve Retrieval” [6] and “A New Study on Using HTML Structures
to Improve Retrieval” [7].
First of all, she
raised an excellent idea of differentiating the term weights for the different HTML
tags. The first paper classified an HTML page into following categories in Table2.1. The detailed specifications and functions
of each tag are not listed here in this section. She also mentioned that the tag
importance is Anchor > H1 – H2 > H3 – H6 > Strong > Title >
Plain Text [6].
Class Name
|
HTML
tags
|
Anchor
|
<a
href=>…<a>
|
H1-H2
|
<h1>…</h1>,
<h2>…</h2>
|
H3-H6
|
<h3>…</h3>,
<h4>…</h4>, <h5>…</h5>, <h6>…</h6>
|
Strong
|
<strong>...</strong>,
<b>…</b>, <em>…</em>, <i>…</i>,
<u>…</u>,
<dl>…</dl>, <ol>…</ol>, <ul>…</ul>
|
Title
|
<title>…</title>
|
Plain Text
|
None of the above
|
Table2.1 [6]
The second paper
classified an HTML page into following categories in Table2.2. The later paper combined all the header tags
together but split the strong tags into 2 categories: list and strong.
Meanwhile, the second paper considered the text in Title tag and Header tag to
be more important than the others rather than Anchor and Header tags are the 2
most important categories in Table2.1 [6]
The tags <dl>, <ol> and
<ul>’s functions are listed in Appendix A.
Class Name
|
HTML
tags
|
Title
|
<title>…</title>
|
Header
|
<h1>…</h1>,
<h2>…</h2>, <h3>…</h3>, <h4>…</h4>,
<h5>…</h5>,
<h6>…</h6>
|
List
|
<dl>…</dl>,
<ol>…</ol>, <ul>…</ul>
|
Strong
|
<strong>...</strong>,
<b>…</b>, <em>…</em>, <i>…</i>,
<u>…</u>
|
Anchor
|
<a href=>…<a>
|
Plain Text
|
None of the above
|
Table2.2 [6]
The basic ideas
behind the two papers’ categories are the same: split the text into different classes
based on their tags and then associate them with different weights. When a term
appears in more than one class, it only counts terms which appear in higher
level. For example, <H1><A href=”http//www.binghamton.edu”>university</A><H1>,
‘university’ is classified into Header category rather than Anchor directory
according to Table2.2 [6], but it is in Anchor category according to
Table2.1 [6].
Figure2.5 is a snapshot from http://research.binghamton.edu/.
The text in the squares is either in Strong tag or Anchor tag, they are
highlighted with either in bigger font size or different color rather than
regular black. Apparently, it is consistent with the author’s intention that he/she
wants people to notice these lines which should draw more attention to the highlighted
content and have more weight than the other un-highlighted text.
Figure2.5
However,
difficulties come along with applying different weight to different HTML tags.
Take the following piece of HTML as an example, in Figure2.6, which is from Yahoo news page:
Figure2.6
Take a careful
look at the red square and orange square, “Mario left a comment: Obama’s ….”,
is separated into 2 different parts, the terms in blue are in Anchor tag which
have HREF links to the other pages, while, ‘left a comment’ in orange square is
taken off from the Anchor tag, and clearly showed in a Strong text style as
compared to “to see what your Connections are…”. However, Yahoo put ‘left a
comment’ into a pre-defined <P> tag and set it into a Strong style. This
can lead the conventional ways in parsing HTML becoming inaccurate and destroy
the original order in the text. As Figure2.7 shows, the <P> tags and <A>
tags are mixed together, which can lead to confusion in differentiating the
text in those 2 kinds of tags if the program is not designed carefully.
Figure2.7
On the other hand,
because these 2 papers focused on their test search engine WEBOR [7]
which was developed by Weiyi Meng and Michal Cutler, Culter’s theory and research
were apparently going on with clearly understanding of the working mechanism in
WEBOR. Meanwhile, Cutler also had the access to control and modify WEBOR itself
according to the requirement of changing CIV [6][7].
The conclusion could be unclear
in applying this LS extraction method to Google, Yahoo or other commercial SEs
which keep their searching mechanism as top secrets from others.