Similar to Xiaojun
Wang’s applying word rank on web pages, this project applies sentence rank on
web page. The passages in the web pages can be extracted as shown in Figure2.19’s red square.
Figure2.19
However, there are
conditions sentence rank can not work. Some web pages may not have sentences or
passages, which make sentence rank on those pages not effective when there are only
titles or phrases. Take Figure2.20 as an example, there is not any passage in
Yahoo’s home page, and although there are several sentences in the center 3 red
squares which are shown in anchor tags separately, it brings difficulties to
construct connections among those independent sentences, because they actually
come from completely different topics. Meanwhile, there are a bunch of simple
words and phrases in the left blue squares, such as “answer”, “auto” and “finance”.
It brings challenges in combining the terms and sentences as well as applying
sentence rank. Therefore, the page like Figure2.20 is not a typical type can be applied by
sentence rank.
Figure2.20 A typical example of
link-based page
There is a simple
way to exclude the pages which are not suitable for sentence rank. A threshold
p is defined to separate the pages into 2 categories linked-based page and
content-based page after using formula 2-11.
2-11
The pages like Figure2.20 can be concluded as a linked-based page
which has a high portion with text in link. The linked-based pages are easily
found from the websites’ home page and index page. Compared to Figure2.19, a content-based page has high portion in
plain text without link such as Figure2.20.