2.4 Michal Cutler’s Study on HTML Structure

View Post

In 1997, Michal Cutler proposed a method that makes use of structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents ^[6]. She classified the HTML into categories based on HTML’s tags, such as Title, H1, H2, H3, H4, H5, H6 and so on, and claimed that the terms in different HTML tags have different weight. Based on this idea, a new method for extracting lexical signatures from a web page can use the terms that have the highest weights that are computed with the HTML tag structures taken into consideration ^[6].

It is quite necessary to outline Cutler’s two papers both: “Using the Structure of HTML Documents to Improve Retrieval” ^[6] and “A New Study on Using HTML Structures to Improve Retrieval” ^[7].

First of all, she raised an excellent idea of differentiating the term weights for the different HTML tags. The first paper classified an HTML page into following categories in Table2.1. The detailed specifications and functions of each tag are not listed here in this section. She also mentioned that the tag importance is Anchor > H1 – H2 > H3 – H6 > Strong > Title > Plain Text ^[6].

Class Name	HTML tags
Anchor	<a href=>…<a>
H1-H2	<h1>…</h1>, <h2>…</h2>
H3-H6	<h3>…</h3>, <h4>…</h4>, <h5>…</h5>, <h6>…</h6>
Strong	<strong>...</strong>, <b>…</b>, <em>…</em>, <i>…</i>, <u>…</u>, <dl>…</dl>, <ol>…</ol>, <ul>…</ul>
Title	<title>…</title>
Plain Text	None of the above

Table2.1 ^[6]

The second paper classified an HTML page into following categories in Table2.2. The later paper combined all the header tags together but split the strong tags into 2 categories: list and strong. Meanwhile, the second paper considered the text in Title tag and Header tag to be more important than the others rather than Anchor and Header tags are the 2 most important categories in Table2.1 ^[6]

The tags <dl>, <ol> and <ul>’s functions are listed in Appendix A.

Class Name	HTML tags
Title	<title>…</title>
Header	<h1>…</h1>, <h2>…</h2>, <h3>…</h3>, <h4>…</h4>, <h5>…</h5>, <h6>…</h6>
List	<dl>…</dl>, <ol>…</ol>, <ul>…</ul>
Strong	<strong>...</strong>, <b>…</b>, <em>…</em>, <i>…</i>, <u>…</u>
Anchor	<a href=>…<a>
Plain Text	None of the above

Table2.2 ^[6]

The basic ideas behind the two papers’ categories are the same: split the text into different classes based on their tags and then associate them with different weights. When a term appears in more than one class, it only counts terms which appear in higher level. For example, <H1><A href=”http//www.binghamton.edu”>university</A><H1>, ‘university’ is classified into Header category rather than Anchor directory according to Table2.2 ^[6], but it is in Anchor category according to Table2.1 ^[6].

Figure2.5 is a snapshot from http://research.binghamton.edu/. The text in the squares is either in Strong tag or Anchor tag, they are highlighted with either in bigger font size or different color rather than regular black. Apparently, it is consistent with the author’s intention that he/she wants people to notice these lines which should draw more attention to the highlighted content and have more weight than the other un-highlighted text.

Figure2.5

However, difficulties come along with applying different weight to different HTML tags. Take the following piece of HTML as an example, in Figure2.6, which is from Yahoo news page:

Figure2.6

Take a careful look at the red square and orange square, “Mario left a comment: Obama’s ….”, is separated into 2 different parts, the terms in blue are in Anchor tag which have HREF links to the other pages, while, ‘left a comment’ in orange square is taken off from the Anchor tag, and clearly showed in a Strong text style as compared to “to see what your Connections are…”. However, Yahoo put ‘left a comment’ into a pre-defined <P> tag and set it into a Strong style. This can lead the conventional ways in parsing HTML becoming inaccurate and destroy the original order in the text. As Figure2.7 shows, the <P> tags and <A> tags are mixed together, which can lead to confusion in differentiating the text in those 2 kinds of tags if the program is not designed carefully.

Figure2.7

On the other hand, because these 2 papers focused on their test search engine WEBOR ^[7] which was developed by Weiyi Meng and Michal Cutler, Culter’s theory and research were apparently going on with clearly understanding of the working mechanism in WEBOR. Meanwhile, Cutler also had the access to control and modify WEBOR itself according to the requirement of changing CIV ^[6][7].

The conclusion could be unclear in applying this LS extraction method to Google, Yahoo or other commercial SEs which keep their searching mechanism as top secrets from others.

posted on 2009-06-15 09:00 JosephQuinn 阅读(394) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

2.4 Michal Cutler’s Study on HTML Structure