4
Experimental Result and
Analysis
All the following
experiments were conducted from April 24, 2009 to April 30, 2009. Each term’s
document frequency is from Google web search interface and counted on April 24,
2009.
4.1
The basics
The basics involve
all extraction LS method from Seung
Park’s paper including
TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and TFIDF4DF1. According to the
length of the query, in this project, they are ranged from 3 to 15 terms in
each query. The ‘3’ and ‘2’ which means 60% percent of TF terms and
40% percent of DF terms among all terms in each query, ‘4’ and ‘1’, which means 80% percent of TF terms and 20%
percent of DF terms among all terms in each query, represent the ratio between
TF and DF or TFIDF and DF. The details are all in section3.1, chapter3. Meanwhile,
due to longer length in each query, another 2 groups can also be added which
are not in Seung Park’s paper, TF5DF5 and TFID5DF5, which
means there are 50% TF and 50% DF terms, 50%TFIDF and 50%DF compose the query.
The detailed TF, DF, IDF selections are strictly followed the idea from Seung Park’s
paper.
The following
charts show the success retrieved number per 225 URLs and its percentage value.
The 225 URL are all listed in Appendix B. It is necessary to mention ‘success
counts’ in Y axis: by sending a query to a search engine, if the first 10
result URLs have at least one match with the original URL, increase 1 to success
counts.
In all the
following sub-sections, the blue lines represent the results from Google web
search while the red lines represent the results from Yahoo web search. The
left charts show the exact number of success retrieved pages among all 225 pages.
The right charts show the success percentage rate which is the value that
success retrieved pages number divided by 225.
TF(Words Number)
|
Google
|
Yahoo
|
3
|
61.00
|
27.11%
|
58.00
|
25.78%
|
4
|
91.00
|
40.44%
|
86.00
|
38.22%
|
5
|
113.00
|
50.22%
|
98.00
|
43.56%
|
6
|
122.00
|
54.22%
|
110.00
|
48.89%
|
7
|
137.00
|
60.89%
|
128.00
|
56.89%
|
8
|
144.00
|
64.00%
|
127.00
|
56.44%
|
9
|
153.00
|
68.00%
|
126.00
|
56.00%
|
10
|
153.00
|
68.00%
|
128.00
|
56.89%
|
11
|
154.00
|
68.44%
|
127.00
|
56.44%
|
12
|
157.00
|
69.78%
|
132.00
|
58.67%
|
13
|
160.00
|
71.11%
|
131.00
|
58.22%
|
14
|
159.00
|
70.67%
|
126.00
|
56.00%
|
15
|
158.00
|
70.22%
|
130.00
|
57.78%
|
Average
|
135.54
|
60.24%
|
115.92
|
51.52%
|
Table4.1 TF
(a) (b)
Figure4.1 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TF
DF
|
Google
|
Yahoo
|
3
|
155.00
|
68.89%
|
136.00
|
60.44%
|
4
|
156.00
|
69.33%
|
140.00
|
62.22%
|
5
|
161.00
|
71.56%
|
140.00
|
62.22%
|
6
|
162.00
|
72.00%
|
134.00
|
59.56%
|
7
|
164.00
|
72.89%
|
129.00
|
57.33%
|
8
|
163.00
|
72.44%
|
129.00
|
57.33%
|
9
|
163.00
|
72.44%
|
134.00
|
59.56%
|
10
|
159.00
|
70.67%
|
130.00
|
57.78%
|
11
|
162.00
|
72.00%
|
131.00
|
58.22%
|
12
|
160.00
|
71.11%
|
126.00
|
56.00%
|
13
|
159.00
|
70.67%
|
130.00
|
57.78%
|
14
|
162.00
|
72.00%
|
123.00
|
54.67%
|
15
|
161.00
|
71.56%
|
125.00
|
55.56%
|
Average
|
160.54
|
71.35%
|
131.31
|
58.36%
|
Table4.2 DF
(a) (b)
Figure4.2 Success retrieved pages’ counts per 225 pages and
corresponding percentage value by DF
Figure4.1 shows that the success retrieved rate is
growing along with the number of terms in a query, and then becoming flat and
stable after 10 terms a query. In this comparison, DF(document frequency) does
not significantly change the success retrieved rate according to Figure4.2, the stability, which is around 70%
success retrieve rate in Google and 60% in Yahoo, ranging all the terms number
from 3 to 15, this suggests that DF has a good performance in identifying the
page itself within the returning results from both Google and Yahoo even the
terms number is low like 3, 4, or 5, which show very similar success rate as
terms number grows higher than 10. As the following experiments show, when the
DF ratio increases in the query such as TF5DF5 which DF terms’ ratio is 50%, the
success rate also increases when the query length is smaller, like 3, 4, or 5,
compared to TF4DF1 which DF terms’ ratio is 20%.
TFIDF
|
Google
|
Yahoo
|
3
|
80.00
|
35.56%
|
81.00
|
36.00%
|
4
|
105.00
|
46.67%
|
93.00
|
41.33%
|
5
|
134.00
|
59.56%
|
114.00
|
50.67%
|
6
|
144.00
|
64.00%
|
125.00
|
55.56%
|
7
|
151.00
|
67.11%
|
141.00
|
62.67%
|
8
|
160.00
|
71.11%
|
131.00
|
58.22%
|
9
|
162.00
|
72.00%
|
135.00
|
60.00%
|
10
|
165.00
|
73.33%
|
137.00
|
60.89%
|
11
|
168.00
|
74.67%
|
133.00
|
59.11%
|
12
|
167.00
|
74.22%
|
142.00
|
63.11%
|
13
|
169.00
|
75.11%
|
131.00
|
58.22%
|
14
|
168.00
|
74.67%
|
133.00
|
59.11%
|
15
|
168.00
|
74.67%
|
131.00
|
58.22%
|
Average
|
149.31
|
66.36%
|
125.15
|
55.62%
|
Table4.3 TFIDF
(a) (b)
Figure4.3 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TFIDF.
PW
|
Google
|
Yahoo
|
3
|
60.00
|
26.67%
|
61.00
|
27.11%
|
4
|
92.00
|
40.89%
|
81.00
|
36.00%
|
5
|
115.00
|
51.11%
|
94.00
|
41.78%
|
6
|
121.00
|
53.78%
|
108.00
|
48.00%
|
7
|
137.00
|
60.89%
|
123.00
|
54.67%
|
8
|
144.00
|
64.00%
|
124.00
|
55.11%
|
9
|
153.00
|
68.00%
|
126.00
|
56.00%
|
10
|
154.00
|
68.44%
|
126.00
|
56.00%
|
11
|
154.00
|
68.44%
|
121.00
|
53.78%
|
12
|
158.00
|
70.22%
|
121.00
|
53.78%
|
13
|
162.00
|
72.00%
|
130.00
|
57.78%
|
14
|
163.00
|
72.44%
|
122.00
|
54.22%
|
15
|
158.00
|
70.22%
|
124.00
|
55.11%
|
Average
|
136.23
|
60.55%
|
112.38
|
49.95%
|
Table4.4 PW
(a) (b)
Figure4.4 success retrieved pages’ counts per 225 pages and
corresponding percentage value by PW
TF3DF2
|
Google
|
Yahoo
|
3
|
128.00
|
56.89%
|
123.00
|
54.67%
|
4
|
137.00
|
60.89%
|
130.00
|
57.78%
|
5
|
154.00
|
68.44%
|
137.00
|
60.89%
|
6
|
162.00
|
72.00%
|
141.00
|
62.67%
|
7
|
165.00
|
73.33%
|
133.00
|
59.11%
|
8
|
170.00
|
75.56%
|
140.00
|
62.22%
|
9
|
172.00
|
76.44%
|
140.00
|
62.22%
|
10
|
168.00
|
74.67%
|
149.00
|
66.22%
|
11
|
170.00
|
75.56%
|
144.00
|
64.00%
|
12
|
168.00
|
74.67%
|
145.00
|
64.44%
|
13
|
165.00
|
73.33%
|
142.00
|
63.11%
|
14
|
166.00
|
73.78%
|
135.00
|
60.00%
|
15
|
163.00
|
72.44%
|
128.00
|
56.89%
|
Average
|
160.62
|
71.38%
|
137.46
|
61.09%
|
Table4.5 TF3DF2
(a) (b)
Figure4.5 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TF3DF2
TF4DF1
|
Google
|
Yahoo
|
3
|
50.00
|
22.22%
|
51.00
|
22.67%
|
4
|
83.00
|
36.89%
|
78.00
|
34.67%
|
5
|
153.00
|
68.00%
|
140.00
|
62.22%
|
6
|
158.00
|
70.22%
|
126.00
|
56.00%
|
7
|
165.00
|
73.33%
|
135.00
|
60.00%
|
8
|
167.00
|
74.22%
|
131.00
|
58.22%
|
9
|
169.00
|
75.11%
|
131.00
|
58.22%
|
10
|
169.00
|
75.11%
|
134.00
|
59.56%
|
11
|
168.00
|
74.67%
|
140.00
|
62.22%
|
12
|
169.00
|
75.11%
|
138.00
|
61.33%
|
13
|
169.00
|
75.11%
|
146.00
|
64.89%
|
14
|
167.00
|
74.22%
|
143.00
|
63.56%
|
15
|
168.00
|
74.67%
|
142.00
|
63.11%
|
Average
|
150.38
|
66.84%
|
125.77
|
55.90%
|
Table4.6 TF4DF1
(a) (b)
Figure4.6 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TF4DF1
TF5DF5
|
Google
|
Yahoo
|
3
|
128.00
|
56.89%
|
122.00
|
54.22%
|
4
|
149.00
|
66.22%
|
137.00
|
60.89%
|
5
|
154.00
|
68.44%
|
140.00
|
62.22%
|
6
|
165.00
|
73.33%
|
147.00
|
65.33%
|
7
|
167.00
|
74.22%
|
147.00
|
65.33%
|
8
|
170.00
|
75.56%
|
143.00
|
63.56%
|
9
|
168.00
|
74.67%
|
144.00
|
64.00%
|
10
|
165.00
|
73.33%
|
139.00
|
61.78%
|
11
|
166.00
|
73.78%
|
138.00
|
61.33%
|
12
|
163.00
|
72.44%
|
133.00
|
59.11%
|
13
|
163.00
|
72.44%
|
138.00
|
61.33%
|
14
|
163.00
|
72.44%
|
131.00
|
58.22%
|
15
|
163.00
|
72.44%
|
132.00
|
58.67%
|
Average
|
160.31
|
71.25%
|
137.7692
|
61.23%
|
Table4.7 TF5DF5
(a) (b)
Figure4.7 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TF5DF5
TFIDF3DF2
|
Google
|
Yahoo
|
3
|
127.00
|
56.44%
|
130.00
|
57.78%
|
4
|
139.00
|
61.78%
|
134.00
|
59.56%
|
5
|
162.00
|
72.00%
|
138.00
|
61.33%
|
6
|
164.00
|
72.89%
|
146.00
|
64.89%
|
7
|
167.00
|
74.22%
|
141.00
|
62.67%
|
8
|
168.00
|
74.67%
|
144.00
|
64.00%
|
9
|
170.00
|
75.56%
|
147.00
|
65.33%
|
10
|
170.00
|
75.56%
|
145.00
|
64.44%
|
11
|
168.00
|
74.67%
|
146.00
|
64.89%
|
12
|
168.00
|
74.67%
|
148.00
|
65.78%
|
13
|
166.00
|
73.78%
|
144.00
|
64.00%
|
14
|
164.00
|
72.89%
|
140.00
|
62.22%
|
15
|
162.00
|
72.00%
|
142.00
|
63.11%
|
Average
|
161.15
|
71.62%
|
141.92
|
63.08%
|
Table4.8 TFIDF3DF2
(a) (b)
Figure4.8 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TFIDF3DF2
TFIDF4DF1
|
Google
|
Yahoo
|
3
|
80.00
|
35.56%
|
81.00
|
36.00%
|
4
|
105.00
|
46.67%
|
96.00
|
42.67%
|
5
|
151.00
|
67.11%
|
135.00
|
60.00%
|
6
|
161.00
|
71.56%
|
124.00
|
55.11%
|
7
|
168.00
|
74.67%
|
137.00
|
60.89%
|
8
|
168.00
|
74.67%
|
139.00
|
61.78%
|
9
|
172.00
|
76.44%
|
140.00
|
62.22%
|
10
|
172.00
|
76.44%
|
142.00
|
63.11%
|
11
|
170.00
|
75.56%
|
146.00
|
64.89%
|
12
|
170.00
|
75.56%
|
142.00
|
63.11%
|
13
|
170.00
|
75.56%
|
141.00
|
62.67%
|
14
|
169.00
|
75.11%
|
147.00
|
65.33%
|
15
|
167.00
|
74.22%
|
145.00
|
64.44%
|
Average
|
155.62
|
69.16%
|
131.92
|
58.63%
|
Table4.9 TFIDF4DF1
(a) (b)
Figure4.9 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TFIDF4DF1
TFIDF5DF5
|
Google
|
Yahoo
|
3
|
127.00
|
56.44%
|
130.00
|
57.78%
|
4
|
154.00
|
68.44%
|
142.00
|
63.11%
|
5
|
162.00
|
72.00%
|
141.00
|
62.67%
|
6
|
164.00
|
72.89%
|
145.00
|
64.44%
|
7
|
167.00
|
74.22%
|
144.00
|
64.00%
|
8
|
169.00
|
75.11%
|
139.00
|
61.78%
|
9
|
169.00
|
75.11%
|
144.00
|
64.00%
|
10
|
166.00
|
73.78%
|
142.00
|
63.11%
|
11
|
167.00
|
74.22%
|
145.00
|
64.44%
|
12
|
164.00
|
72.89%
|
131.00
|
58.22%
|
13
|
163.00
|
72.44%
|
146.00
|
64.89%
|
14
|
162.00
|
72.00%
|
135.00
|
60.00%
|
15
|
163.00
|
72.44%
|
133.00
|
59.11%
|
Average
|
161.31
|
71.69%
|
139.77
|
62.12%
|
Table4.10 TFIDF5DF5
(a) (b)
Figure4.10 success retrieved pages’ counts per 225 pages and
corresponding percentage value by TFIDF5DF5
As shown in Figure4.1 to Figure4.10, the basic information retrieval methods
are well applied on 225 online pages, after query terms number exceeds 10, the
rate is around 70% from Google and above 60% from Yahoo, meanwhile, all of them
become stable after terms number larger than 10. Then the average success rate
is computed separately from terms number 3 to 15 and the results are shown in Figure4.11. For a better comparison, I take Title
method in section4.2, chapter4 ahead. Apparently, DF, TF3DF2, TF5DF5, TFIDF3DF2
and TFIDF5DF5 have higher success rate than the others. They have more than 70%
success rate from Google and 60% from Yahoo, except DF’s Yahoo result, but it
is still higher than the others. Again, it shows the DF’s importance in
retrieval.
Figure4.11 all basic TF, DF and IDF related methods comparison