The text in HTML’s
title tag is always playing a vital role in web page retrieval. During the
beginning of this project, an extensive amount of experiments were conducted by
using the title method. It was believed that the success rate would reach 90%
from using title text as a query if the query could be composed carefully and
properly. Figure4.12 shows that the title method also has a
good stability along with the words number in a query. It is important to
mention that, from Figure4.1 to Figure4.10, although the classic methods have better
results, it only means the HTML extractions have good performance, which filter
the structural HTML tags and functional scripts which could be big distractions
in the following application on the target page, because all the basic
retrieval process is only designed for pure text without structural tags. For
example, HTML tags like ‘td’ and ‘tr’ will have a big term frequencies and the
function or variable names in Javascript will cause a very low document
frequencies, if they are not filtered or removed in the pre-processing step.
However, by using title method, it is much easier to extract the text
information only between <title> and </title>.
Title tag
|
Google
|
|
Yahoo
|
|
3
|
82.00
|
36.44%
|
72
|
32.00%
|
4
|
91.00
|
40.44%
|
86.00
|
38.22%
|
5
|
111.00
|
49.33%
|
94.00
|
41.78%
|
6
|
116.00
|
51.56%
|
99.00
|
44.00%
|
7
|
116.00
|
51.56%
|
102.00
|
45.33%
|
8
|
115.00
|
51.11%
|
102.00
|
45.33%
|
9
|
115.00
|
51.11%
|
101.00
|
44.89%
|
10
|
115.00
|
51.11%
|
101.00
|
44.89%
|
11
|
115.00
|
51.11%
|
102.00
|
45.33%
|
12
|
117.00
|
52.00%
|
102.00
|
45.33%
|
13
|
118.00
|
52.44%
|
103.00
|
45.78%
|
14
|
126.00
|
56.00%
|
111.00
|
49.33%
|
15
|
127.00
|
56.44%
|
112.00
|
49.78%
|
Average
|
112.62
|
50.05%
|
99.00
|
44.00%
|
Table4.11
(a) (b)
Figure4.12 Use title terms as search query