3.3.1 Page Quality

View Post

Collecting a large amount of web pages and their URL as data set for searching objects should have the following considerations.

3.3.1 Page Quality

A well formatted and carefully maintained page is a good source for applying text retrieval algorithms introduced above. However, randomly crawled web pages starting from one or some initial URLs could have page quality problems. Because the ill-formatted, outdated and poor content in web pages do not at least like the plain text documents or papers which are largely used in normal text retrieval. After extracting and applying algorithms on those pages, the unrelated terms such as HTML tags and scripts will take the key words place, or sometimes several terms are too general to identify the page itself.

There is no clear judgment on what a good web page is in general, but there are some simple standards which can help filter out some un-qualified pages. In S. T. Park’s research and Xiaojun Wang’s research, both of them remove the pages which are not HTML format, less than 50 words or less than 5 unique words after removing all the stop words and have errors during downloading process by designated URL because of server failures

Figure3.10 is a typical example of an outdated page from domain binghamton.edu. In the left bottom’s red rectangle, the last update time is June, 2007. Although the 2 red rectangles in the top seem to have key words, such as Binghamton University, blackboard and BU, they are all images which prevent all text retrieval accesses. There are several text in the green rectangles, but apparently, they only show that this page is actually expired and replaced by http://bubrain.binghamton.edu.

Figure3.10 http://busi.binghamton.edu

Figure3.11 http://forums.myspace.com/s/4.aspx?fuseaction=forums.viewsubforum

Figure3.11 is another typical example of showing not enough text information to identify itself. Although it is an updated page, the terms showing on the pages are too general, such as “acoustic”, “alternative” and “electronic” in the top, they are apparently not good candidate terms for identifying it is a Myspace catalog page. Here is a simple manual test by sending query “myspace acoustic alternative electronic/dance emo general hardcore hip-hop” to Google, the 10 results in first page are related to Myspace but none of them is

http://forums.myspace.com/s/4.aspx?fuseaction=forums.viewsubforum.

Figure3.12

From Figure3.12, the 10 results are

1. forums.myspace.com/

2. forums.myspace.com/Default.aspx?fuseaction=forums.home&filterlang=en-US

3. www.mp3.com/tags/alternative/&page=5

4. www.soundclick.com/search/main.cfm?realm=7&city=Flagstaff&state=&country=USA

5. www.cnet.com.au/music/0,2000067577,8586q,00.htm

6. www.newcastlemusic.com/artists.php?letter=T

7. indiestore.7digital.com/portal/tags/experimental/?startIndex=51&sort=4

8. www.urbandictionary.com/define.php?term=Scene+Kid&page=6

9. www.sellaband.com/q4/

10. www.musicbanter.com/sitemap/f/f-11-p-12.html

Only the first 2 URLs are related to Myspace domain but one of them is the home page from forums.myspace.com and the other is a different catalog with “automotive”, “business & entrepreneurs”, “campus life”, “career center”, “comedy”, “computer & technology”, “culture, arts & literature”, “fashion”, “filmmakers”, “food & drink” and “games” rather than “acoustic”, “alternative”, “electronic/dance”, “emo”, “general”, “hard core”, “hip-hop”, “metal”, “punk” and “rock”. This page is shown in Figure3.13. Compare it to Figure3.11, these 2 pages have quite different topics individually.

Figure3.13 http://forums.myspace.com/Default.aspx?fuseaction=forums.home&filterlang=en-US

Although, Figure3.11 shows very similar content and same page structure compared to Figure3.13, they are not even focusing on the same field, none of the text or links in 2 pages are about the same content. Therefore, the pages like Figure3.10, Figure3.11 and Figure3.13 are not the types can be used as data source in the project, they are either not updated or contain enough text information. In this project, the Seung Park and Xiaojun Wang’s standard in filtering the un-qualified pages is adopted as well.

1. Only consider pages that end up with .html and .htm or only have domain as URL such as http://www.binghamton.edu.

2. Only consider pages with more than 50 words.

3. Only consider pages with more than 15 unique words after removing all the stop words. It is different from S. T. Park and Xiaojun Wang’s standard, which is 5. Because the number of terms per query is varied in this project, from 3 and up to 15 a query, rather than a fixed number, like 5.

4. Only consider pages that can be accessed without any connection error.

Meanwhile, in this project, unlike S. T Park and Xiaojun Wang’s researches, which started to crawl web from a designated domain, only two kinds of pages are studied, the web pages in Binghamton University and Yahoo News. Because pages in domain binghamton.edu and news.yahoo.com as the shown from Figure3.14 to Figure3.21, are well maintained, plenty of content and in good format.

Figure3.14

Figure3.15

The Yahoo News’ page follows some fixed templates: the navigation links in the top followed by 2 advertisements in rectangles, news title and update date and main content in the center, top stories’ links and another advertisement in the right bottom corner. It makes extracting the main content relatively easy because the template will not be changed frequently.

Figure3.16

Figure3.17

Figure3.16 and Figure3.17 show another template from Yahoo News’ page. Similar to Figure3.14 and Figure3.15, the news title and main content stay in the middle, the relative links and advertisements stay in either left or right side. Figure3.18 and Figure3.19 are 2 news pages from bnghamton.edu which have the main content in the center and navigation links both in right and left. Commercial posters and advertisements are rare in domain binghamton.edu compared to yahoo.com.

Figure3.18

Figure3.19

Figure3.20 and Figure3.21 are index pages from binghamton.edu, because both of them are rich in content, they also can be the test pages as well as the Binghamton University’s news page like in Figure3.18 and Figure3.19.

Figure3.20

Figure3.21

posted on 2009-06-18 07:41 JosephQuinn 阅读(470) 评论(0) 编辑收藏所属分类: My Master-degree Project

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园 IT新闻 Chat2DB C++博客博问管理
相关文章: Appendix B 5 Conclusion 4.7 Sentence Rank on Yahoo News Page 4.6 Sentence Rankv 4.5 Random pick sentence 4.4 Word Rank 4.3 Google search tips: meta keys and meta description 4.2 Title 4.1 The basics 3.5 Deep Web Search Engine

Avenue U

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

View Post

3.3.1 Page Quality