3.3
Data Set
Collecting a large
amount of web pages and their URL as data set for searching objects should have
the following considerations.
3.3.1
Page Quality
A well formatted
and carefully maintained page is a good source for applying text retrieval
algorithms introduced above. However, randomly crawled web pages starting from
one or some initial URLs could have page quality problems. Because the
ill-formatted, outdated and poor content in web pages do not at least like the
plain text documents or papers which are largely used in normal text retrieval.
After extracting and applying algorithms on those pages, the unrelated terms
such as HTML tags and scripts will take the key words place, or sometimes several
terms are too general to identify the page itself.
There is no clear judgment
on what a good web page is in general, but there are some simple standards which
can help filter out some un-qualified pages. In S. T. Park’s research and
Xiaojun Wang’s research, both of them remove the pages which are not HTML
format, less than 50 words or less than 5 unique words after removing all the
stop words and have errors during downloading process by designated URL because
of server failures
Figure3.10 is a typical example of an outdated page
from domain binghamton.edu. In the left bottom’s red rectangle, the last update
time is June, 2007. Although the 2 red rectangles in the top seem to have key
words, such as Binghamton
University, blackboard
and BU, they are all images which prevent all text retrieval accesses. There are
several text in the green rectangles, but apparently, they only show that this
page is actually expired and replaced by http://bubrain.binghamton.edu.
Figure3.10 http://busi.binghamton.edu
Figure3.11
http://forums.myspace.com/s/4.aspx?fuseaction=forums.viewsubforum
Figure3.11 is another typical example of showing not
enough text information to identify itself. Although it is an updated page, the
terms showing on the pages are too general, such as “acoustic”, “alternative” and
“electronic” in the top, they are apparently not good candidate terms for
identifying it is a Myspace catalog page. Here is a simple manual test by
sending query “myspace acoustic alternative electronic/dance emo general
hardcore hip-hop” to Google, the 10 results in first page are related to
Myspace but none of them is
http://forums.myspace.com/s/4.aspx?fuseaction=forums.viewsubforum.
Figure3.12
From Figure3.12, the
10 results are
1. forums.myspace.com/
2. forums.myspace.com/Default.aspx?fuseaction=forums.home&filterlang=en-US
3. www.mp3.com/tags/alternative/&page=5
4. www.soundclick.com/search/main.cfm?realm=7&city=Flagstaff&state=&country=USA
5. www.cnet.com.au/music/0,2000067577,8586q,00.htm
6. www.newcastlemusic.com/artists.php?letter=T
7. indiestore.7digital.com/portal/tags/experimental/?startIndex=51&sort=4
8. www.urbandictionary.com/define.php?term=Scene+Kid&page=6
9. www.sellaband.com/q4/
10. www.musicbanter.com/sitemap/f/f-11-p-12.html
Only the first 2
URLs are related to Myspace domain but one of them is the home page from
forums.myspace.com and the other is a different catalog with “automotive”, “business
& entrepreneurs”, “campus life”, “career center”, “comedy”, “computer &
technology”, “culture, arts & literature”, “fashion”, “filmmakers”, “food
& drink” and “games” rather than “acoustic”, “alternative”, “electronic/dance”,
“emo”, “general”, “hard core”, “hip-hop”, “metal”, “punk” and “rock”. This page
is shown in Figure3.13. Compare
it to Figure3.11, these
2 pages have quite different topics individually.
Figure3.13 http://forums.myspace.com/Default.aspx?fuseaction=forums.home&filterlang=en-US
Although, Figure3.11
shows very similar content and same page structure compared to Figure3.13, they
are not even focusing on the same field, none of the text or links in 2 pages
are about the same content. Therefore, the pages like Figure3.10, Figure3.11 and Figure3.13 are
not the types can be used as data source in the project, they are either not
updated or contain enough text information. In this project, the Seung Park
and Xiaojun Wang’s standard in filtering the un-qualified pages is adopted as
well.
1. Only consider pages that end up with .html
and .htm or only have domain as URL such as http://www.binghamton.edu.
2. Only consider pages with more than 50
words.
3. Only consider pages with more than 15 unique
words after removing all the stop words. It is different from S. T. Park and
Xiaojun Wang’s standard, which is 5. Because the number of terms per query is varied
in this project, from 3 and up to 15 a
query, rather than a fixed number, like 5.
4. Only consider pages that can be accessed
without any connection error.
Meanwhile, in this
project, unlike S. T Park and Xiaojun Wang’s researches, which started to crawl
web from a designated domain, only two kinds of pages are studied, the web
pages in Binghamton University and Yahoo News. Because pages in domain binghamton.edu
and news.yahoo.com as the shown from Figure3.14 to Figure3.21, are well maintained, plenty of content
and in good format.
Figure3.14
Figure3.15
The Yahoo News’ page
follows some fixed templates: the navigation links in the top followed by 2
advertisements in rectangles, news title and update date and main content in
the center, top stories’ links and another advertisement in the right bottom
corner. It makes extracting the main content relatively easy because the
template will not be changed frequently.
Figure3.16
Figure3.17
Figure3.16 and Figure3.17 show another template from Yahoo News’ page. Similar
to Figure3.14 and Figure3.15, the news title and main content stay in the middle,
the relative links and advertisements stay in either left or right side. Figure3.18 and Figure3.19 are 2 news pages from bnghamton.edu which have the
main content in the center and navigation links both in right and left.
Commercial posters and advertisements are rare in domain binghamton.edu
compared to yahoo.com.
Figure3.18
Figure3.19
Figure3.20 and Figure3.21 are index pages from binghamton.edu, because both of
them are rich in content, they also can be the test pages as well as the Binghamton
University’s news page like in Figure3.18 and Figure3.19.
Figure3.20
Figure3.21