2009年6月18日随笔档案 - Avenue U

String Stream in C++

String Streams

The stringstream is a class that is useful for extracting data from or writing formatted data to strings.
A very common question, for example is this: ``How do I convert a string to a number ? ''. Of course, there's no way to do so in general, since not all strings look like numbers. But it's certainly not unusual that we'd want to extract a number from a string. Note that the g++ compiler ships without the sstream header.

FAQ: How Do I Convert String To Number ?

1 #include <iostream>
2 #include <sstream>
3 using namespace std;
4 int main()
5 {
6     int a;
7     string s = "456";
8     istringstream sin(s);
9     sin >> a;
10     cout << a << endl;
11     return 0;
12 }

The istringstream class is an input stream attached to a string. The constructor copies the string s into its private buffer, and then we may extract data from it using regular stream semantics, in this case we use sin>>a.

FAQ: How Do I Convert A Number To A String ?
or how do I do ``sprintf'' in C++ ?

You may have guessed -- the answer is to use an ostringstream. This data type behaves in a similar way to the istringstream. The main difference is that there is an extra method, str(). This method returns the string that lies in the ostringstream object. There's also a verision of str() that takes a string argument -- this initialises the streams underlying string buffer to that argument. This is commonly used to clear a stream for reuse -- one can call mystream.str("");

1 #include <sstream>
2 #include <iostream>
3 int main()
4 {
5     std::ostringstream strout;
6     int x = 42;
7     strout << "The answer to the question is " << 42 << std::endl;
8     cout << strout.str() << endl;
9     strout.str("");
10     strout << 53.2;
11     cout << strout.str() << endl;
12     return 0;
13 }

Using Stringstreams To Parse Input

A problem that often comes up is this: suppose you have written the following code:

1 #include <iostream>
2
3 using namespace std;
4 int main()
5 {
6     int x;
7     do
8         cout << "Enter a positive integer (0 to quit)" << endl;
9     while ( cin >> x && x != 0 );
10     return 0;
11 }

What happens if the user enters

2 3

Or worse, if they enter:

5.0

The problem is that the extraction operator does not expect each item extracted to be on a seperate line. So if you do expect this, you need to be explicit about it in your code. The way to do this is use getline() to read a line of input into a string and then use that string to create an istringstream from which we can extract data. After we've extracted the data we need, we can check for trailing garbage. Here's an example:

1 #include <sstream>
2 #include <iostream>
3
4 using std::cin;
5 using std::cout;
6 using std::end;
7
8 int main()
9 {
10     int x;
11     char ch;
12     std::string myString;
13     while (getline ( cin, myString ))
14     {
15         std::istringstream strin(myString);
16         strin >> x;
17         if (!strin)
18             cout << "Bad input" << endl;
19         else if ( strin >> ch )
20             cout << "Bad input" << endl;
21         else
22             cout << "You entered: " << x << endl;
23     }
24     return 0;
25 }

Some notes:

    * The istringstream object is declared within the loop, so its scope is the block of the loop. So a new istringstream object is created and the old one is destroyed for each iteration of the loop.
    * The test (!strin) checks to see if the stream is in an error state.
    * The attempt to extract a character is a test to see if there's anything left in the stream after we extract an integer. Note that this ignores whitespace (which is the desired effect in this case)
    * If there are no problems, then the extraction was succesful.

strstream considered harmful

There exists a deprecated class similar to stringstream that is called strstream. Do not confuse these two classes. They are not the same. strstream has a very error prone interface because of the way it handles memory. One problem is that it does not append trailing nulls when str() is called. So you must append a trailing null, or std::ends. Another important thing to remember is that if you use ostrstream with a dynamic buffer, like this:

std::ostrstream strout;
strout << "The answer is ..." << 42 << std::endl << std::ends;
strout.str();

Then calling str has the peculiar side effect that the caller is responsible for managing the memory allocated by strout's buffer, which is pretty silly since the caller does not know how the memory was allocated (it may be allocated using malloc() or new). So to make the stupid thing take its memory back, one makes the following call:

strout.freeze(0);

It's worth mentioning that there is another, safer way to use ostrstream and that is to use it with a static buffer. If you do this, you don't need to deal with this freeze() nonsense. To do this, call the constructor that takes a character array as an argument.

char a[100];
std::ostrstream strout(a,100);
strout << "the answer is" << 42 << std::endl << std::ends;
std::cout << a << std::endl;

posted @ 2009-07-04 01:22 JosephQuinn 阅读(506) | 评论 (0) | 编辑收藏

Validators in Struts2

摘要: Here is an example validator for bean class User. RegisterAction-validation.xml Code highlighting produced by Actipro CodeHighlighter (freeware) http://www.CodeHighlighter.com/ --> 1&... 阅读全文

posted @ 2009-06-27 04:37 JosephQuinn 阅读(1194) | 评论 (0) | 编辑收藏

An Interceptor Example in Strut2-Spring-Hibernate Application

摘要: Here is a interceptor example in ssh structure web application. First of all, I write a login interceptor and add it into default interceptor stack. It only can print out on console without any oth... 阅读全文

posted @ 2009-06-26 06:01 JosephQuinn 阅读(654) | 评论 (0) | 编辑收藏

3 Validators in Struts2-Spring-Hibernate

摘要: In this article, I will introduce 3 ways in writing validators in Struts2-Spring-Hibernate application. First of all, create package com.ssh.bean and create a bean named 'User' under this package: ... 阅读全文

posted @ 2009-06-25 08:21 JosephQuinn 阅读(477) | 评论 (0) | 编辑收藏

Strut2-Spring-Hibernate under Lomboz Eclipse3.3

摘要: It's only the simpliest Struts2-Spring-Hibernate web application which only has a login function. I will call it SSH which is apparently different from the one appeared in Linux. I assume the installa... 阅读全文

posted @ 2009-06-24 05:50 JosephQuinn 阅读(1105) | 评论 (0) | 编辑收藏

Run Spring by Maven2 in Vista

I give a basic helloworld Spring example which references the example from "Spring in Action" chapter2.

Of course JDK must be installed, %JAVA_HOME% and both %Path% and %Classpath% are filled with correct path.

Maven2 Installation

Because it's the first example showing how Maven works in Spring, all the mandatory steps are included without other optional configurations.

Windows 2000/XP/Vista

   1. Unzip the distribution archive, i.e. apache-maven-2.0.10-bin.zip to the directory F:\Development\j2eelib\apache_maven\apache-maven-2.1.0

   2. Add the M2_HOME environment variable by opening up the system properties (WinKey + Pause), selecting the "Advanced" tab, and the "Environment Variables" button, then adding the M2_HOME variable in the user variables with the value F:\Development\j2eelib\apache_maven\apache-maven-2.1.0

Be sure to omit any quotation marks around the path even if it contains spaces. Note: For Maven < 2.0.9, also be sure that the M2_HOME doesn't have a '\' as last character.

   3. In the same dialog, add the M2 environment variable in the user variables with the value %M2_HOME%"bin.

   4. Optional: In the same dialog, add the MAVEN_OPTS environment variable in the user variables to specify JVM properties, e.g. the value -Xms256m -Xmx512m. This environment variable can be used to supply extra options to Maven.

   5. In the same dialog, update/create the Path environment variable in the user variables and prepend the value %M2% to add Maven available in the command line.

   6. In the same dialog, make sure that JAVA_HOME exists.

   7. Open a new command prompt (Winkey + R then type cmd) and run mvn --version to verify that it is correctly installed.

Create First "Helloworld" Spring Project by Maven2.

Step 1.
Open windows cmd and go into a directory which is prepared to be the project's location.
e.g. F:\Development\Java\maven\

mvn archetype:create -DgroupId=com.springinaction.chapter01 -DartifactId=springinaction

Step 2.
Go into sub-directory springinaction which is created by mvn command above: cd springinaction
It's worth to mention that checking all the available dependencies on Maven's website as well as their version number is necessary:
http://mvnrepository.com/artifact/org.springframework/spring

Config pom.xml first:

1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3   <modelVersion>4.0.0</modelVersion>
4   <groupId>com.springinaction.chapter01</groupId>
5   <artifactId>springinaction</artifactId>
6   <packaging>jar</packaging>
7   <version>1.0-SNAPSHOT</version>
8   <name>springinaction</name>
9   <url>http://maven.apache.org</url>
10   <dependencies>
11     <dependency>
12       <groupId>junit</groupId>
13       <artifactId>junit</artifactId>
14       <version>3.8.1</version>
15       <scope>test</scope>
16     </dependency>
17
18     
19     <dependency>
20         <groupId>org.springframework</groupId>
21     <artifactId>spring</artifactId>
22     <version>2.5.6</version>
23     </dependency>
24
25     <dependency>
26         <groupId>org.springframework</groupId>
27     <artifactId>spring-aop</artifactId>
28     <version>2.5.6</version>
29     </dependency>
30
31      <dependency>
32         <groupId>org.springframework</groupId>
33     <artifactId>spring-beans</artifactId>
34     <version>2.5.6</version>
35      </dependency>
36
37      <dependency>
38         <groupId>org.springframework</groupId>
39     <artifactId>spring-core</artifactId>
40     <version>2.5.6</version>
41      </dependency>
42
43   </dependencies>
44 </project>

Edit App.java which have been created already or create any new java which is needed.

1 package com.springinaction.chapter01;
2 import org.springframework.beans.factory.BeanFactory;
3 import org.springframework.beans.factory.xml.XmlBeanFactory;
4 import org.springframework.core.io.FileSystemResource;
5
6 import com.springinaction.chapter01.service.GreetingService;
7
8 public class App{
9     public static void main( String[] args ){
10         BeanFactory factory = new XmlBeanFactory(new FileSystemResource("hello.xml"));
11
12     GreetingService greetingService = (GreetingService) factory.getBean("greetingService");
13     greetingService.sayGreeting();
14     }
15 }

Create service directory under com.springinaction.chapter01 and create GreetingService.java:

1 package com.springinaction.chapter01.service;
2
3 public interface GreetingService{
4 void sayGreeting();
5

Create serviceimpl directory under com.springinaction.chapter01.service and create GreetingServiceImpl.java

1 package com.springinaction.chapter01.service.serviceimpl;
2
3 import com.springinaction.chapter01.service.GreetingService;
4
5 public class GreetingServiceImpl implements GreetingService{
6     private String greeting;
7     public GreetingServiceImpl(){}
8     public GreetingServiceImpl(String greeting){
9         this.greeting = greeting;
10     }
11     public void sayGreeting(){
12         System.out.println(greeting);
13     }
14     public void setGreeting(String greeting){
15         this.greeting = greeting;
16     }
17 }

The directories structure should be like this in the red square.

The directories service and serviceimpl are created manually, but others are created by Maven automatically.

Create hello.xml under root directory springinaction

1 <?xml version="1.0" encoding="UTF-8"?>
2 <beans xmlns="http://www.springframework.org/schema/beans"
3        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4        xsi:schemaLocation="
5 http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd">
6
7 <bean id="greetingService" class="com.springinaction.chapter01.service.serviceimpl.GreetingServiceImpl">
8     <property name="greeting" value="Hello World in Spring Bean!"/>
9 </bean>
10
11 </beans>

Step3:
Under cmd mode and in the root directory springinaction:

mvn compile
mvn test

or

mvn package

And then:

java -cp target/springinaction-1.0-SNAPSHOT.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\spring.jar;F:\Development\j2eelib\spring-framework-2.5.6\lib\jakarta-commons\commons-logging.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-aop.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-beans.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-core.jar com.springinaction.chapter01.App

Notice: Change your cp parameters according to specific location in your environment.

It is important to include all necessary jar files which are needed in the project.
-cp means classpath.

Result:

After debug, make clean is necessary which will delete all compiled classes, jar and their directories.

mvn clean

posted @ 2009-06-24 03:16 JosephQuinn 阅读(456) | 评论 (0) | 编辑收藏

Appendix B

摘要: Normal 0 7.8 pt 0 2 false false false EN-US ZH-CN X-NONE MicrosoftInternetExplorer4 ... 阅读全文

posted @ 2009-06-18 12:26 JosephQuinn 阅读(375) | 评论 (0) | 编辑收藏

5 Conclusion

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 12:25 JosephQuinn 阅读(362) | 评论 (0) | 编辑收藏

4.7 Sentence Rank on Yahoo News Page

Due to the excellent performance by sentence rank, a further experiment is conducted: applying sentence rank on real news web pages. In this section, due to the length of report, only implement undirected graph and 10 terms per query, the following success retrieve rate shows a high percentage value when the cosine similarity on 2 web pages is applied by using 4.1and 4.2. 10 terms a query means only take first 10 words in the selected sentence including stop words which is consistent with section 4.6. Unlike locating the exact address of a web page itself, this comparison leads to find similar topic document by comparing 2 different URL web pages, the details are all introduced in section 3.4.

4.1

4.2

Meanwhile, there are 3 search engines employed in this section: Yahoo News Search, Yahoo Web Search and Google Web Search. Unlike from section 4.6 to section 4.1 which only count URL string match as success retrieval, section 4.7 take document similarity into consideration, and if equation 4.2’s value is bigger than 0.9, which is also permitted in S. T Park and Xiaojun Wang’s research, a success retrieval is considered effective. There are 183 pages in this section which are all from May 4, 2009, Yahoo News, and all related URL addresses are listed in Appendix D.

	Success Counts	Success Rate
Yahoo News Search	171	93.44%
Yahoo Web Search	178	97.27%
Google Web Search	177	96.72%

Table4.29

(a) (b)

Figure4.32

As Figure4.32’s (b) shows, the success rate is above 90% which satisfies the project’s initial requirements by applying a single text retrieval method.

posted @ 2009-06-18 12:16 JosephQuinn 阅读(393) | 评论 (0) | 编辑收藏

4.6 Sentence Rankv

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 12:10 JosephQuinn 阅读(381) | 评论 (0) | 编辑收藏

4.5 Random pick sentence

As stated in chapter 3, the sentence rank can significantly improve the linguistic summarization other than the traditional TF or DF methods. Considering the complexity in sentence rank, randomly pick a sentence and take the first 3 to 15 words from the sentence within its original order as search query can avoid the iterations in graph-based ranking algorithm, and the results below show that even the sentences are randomly picked, when the number of terms up to 10, the performance increases enormously, some of them are higher than 75%, which cannot be accomplished easily by the previous carefully designed retrieval algorithms.

Random Sentence	Google		Yahoo
3	88.00	39.11%	90.00	40.00%
4	114.00	50.67%	102.00	45.33%
5	134.00	59.56%	124.00	55.11%
6	150.00	66.67%	144.00	64.00%
7	155.00	68.89%	137.00	60.89%
8	162.00	72.00%	154.00	68.44%
9	161.00	71.56%	144.00	64.00%
10	168.00	74.67%	151.00	67.11%
11	168.00	74.67%	151.00	67.11%
12	170.00	75.56%	168.00	74.67%
13	172.00	76.44%	168.00	74.67%
14	171.00	76.00%	169.00	75.11%
15	175.00	77.78%	174.00	77.33%
Average	152.92	67.97%	144.31	64.14%

Table4.24

(a) (b)

Figure4.26 Random Sentence Pick from Google and Yahoo results

posted @ 2009-06-18 12:00 JosephQuinn 阅读(289) | 评论 (0) | 编辑收藏

4.4 Word Rank

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 11:44 JosephQuinn 阅读(392) | 评论 (0) | 编辑收藏

4.3 Google search tips: meta keys and meta description

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 10:57 JosephQuinn 阅读(327) | 评论 (0) | 编辑收藏

4.2 Title

The text in HTML’s title tag is always playing a vital role in web page retrieval. During the beginning of this project, an extensive amount of experiments were conducted by using the title method. It was believed that the success rate would reach 90% from using title text as a query if the query could be composed carefully and properly. Figure4.12 shows that the title method also has a good stability along with the words number in a query. It is important to mention that, from Figure4.1 to Figure4.10, although the classic methods have better results, it only means the HTML extractions have good performance, which filter the structural HTML tags and functional scripts which could be big distractions in the following application on the target page, because all the basic retrieval process is only designed for pure text without structural tags. For example, HTML tags like ‘td’ and ‘tr’ will have a big term frequencies and the function or variable names in Javascript will cause a very low document frequencies, if they are not filtered or removed in the pre-processing step. However, by using title method, it is much easier to extract the text information only between <title> and </title>.

Title tag	Google		Yahoo
3	82.00	36.44%	72	32.00%
4	91.00	40.44%	86.00	38.22%
5	111.00	49.33%	94.00	41.78%
6	116.00	51.56%	99.00	44.00%
7	116.00	51.56%	102.00	45.33%
8	115.00	51.11%	102.00	45.33%
9	115.00	51.11%	101.00	44.89%
10	115.00	51.11%	101.00	44.89%
11	115.00	51.11%	102.00	45.33%
12	117.00	52.00%	102.00	45.33%
13	118.00	52.44%	103.00	45.78%
14	126.00	56.00%	111.00	49.33%
15	127.00	56.44%	112.00	49.78%
Average	112.62	50.05%	99.00	44.00%

Table4.11

(a) (b)

Figure4.12 Use title terms as search query

posted @ 2009-06-18 08:25 JosephQuinn 阅读(254) | 评论 (0) | 编辑收藏

4.1 The basics

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 08:15 JosephQuinn 阅读(298) | 评论 (0) | 编辑收藏

3.5 Deep Web Search Engine

A real implementation from this project is whether the ability of testing on general search engine can be applied on testing deep web search engine. The general search engines such as Google and Yahoo have been widely approved in their proper results and links. However, many sites may not allow their documents to be indexed but instead may allow the documents to be accessed through their search engines only, these sites are part of the so-called Deep Web ^[1][17]. The deep web search engines which only focus on their own data base and pages, data or documents which are kept privately and cannot be searched by general search engines. Take www.taobao.com as an example, it is a online commercial trading site like www.ebay.com, Taobao apparently abandons general search engines such as www.baidu.com and www.google.com to access its commodities results after the negotiations broken with the big search engine companies. This leads people who want commodity and price information have to go directly to Taobao’s own search engine interface and browse result items in Taobao’s website. Obviously, search engines in Taobao are probably developed by their own or contract consultant software teams, the performance then will be an interesting topic rather than the ones generally accepted by the public such as Google and Yahoo. The specific introduction for deep web and implementation of deep web search engines are not part of this project, but the practical value from this project can offer a feasible way in testing local and small search engines embedded in their own web sites.

posted @ 2009-06-18 07:53 JosephQuinn 阅读(342) | 评论 (0) | 编辑收藏

3.4 Result Page Comparison

If there is a URL match or content match, a success retrieval is established. If the URL does not match, due to URL’s changing all the time ^[2][3], comparison between original page and retrieved pages is indispensable and taken by 2 ways, manually and automatically. Manually checking all the content between original page and retrieved pages is time consuming but it can guarantee the precise results. In this project, we pick around 200 pages from the data source for manual checking. Rather than by brute force, automatic comparison between the result pages from search engine and each test page also needs HTML page preprocessing as in step 2.

3-1

3-2

In 3-2, TF_w is the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied on pages, therefore, the comparison between 2 pages is only focusing on the main content which means all the advertisement, copyrights information, sponsor’s links and information are removed. It can be concluded as finding a similar topic within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence rank algorithm, the highest ranking sentence can be picked up, input as a query into SE and then compared to the result page.

Figure3.22 (a) and (b) is an example of proving the validity of cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b), it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a) was downloaded in the morning on March 2, 2009 and Figure3.22 (b) was downloaded at noon on the same day. Apparently, Yahoo news editors keep updating and modifying the same news, so the later one gives some differences in the content but actually they are talking about the same issue.

Figure3.22 shows the downloaded HTML file images and (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm

The retrieval URL is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43

By comparing the different URL, it is obviously that even about the same content, yahoo news changes URL by adding “_43” in the end.

(a) (b)

Figure3.22

(a) (b)

Figure3.23

Figure3.23 (a) and (b) is an example of finding a similar content web page, according to a downloaded local page Figure3.23 (a). Obviously, they are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular news at the time of this experiment.

Figure3.23 (a)’s URL is

http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl

Figure3.23 (b)’s URL is

http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003

The documents’ similarity is 98.38% by 3-2

Figure3.24 (a) and (b) is another example of finding a similar content web page according to a downloaded local page. They are both talking the children’s blood lead level.

Figure3.24 (a)’s URL is:

http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy

Figure3.24 (b)’s URL is

http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9

The documents similarity is 94% by 3-2.

(a) (b)

Figure3.24

posted @ 2009-06-18 07:52 JosephQuinn 阅读(299) | 评论 (0) | 编辑收藏

3.3.3 Query Length

In chapter2, section2.1, S. T. Park adopted 5 terms a query. However, a wider range of term numbers in a query is adopted in this project: the length of LS from 3 to 15 versus the success rate is compared together while sentence rank, take first N words in the selected sentence, from 3 to 15, even including stop words from the top ranked sentences as a search query and remove the rest of them left in the sentences. This procedure does not follow the traditional ways in text retrieval, however, in chapter 5, the experiments show even better results when the terms number are more than 10, compared to the same terms number in traditional ways.

posted @ 2009-06-18 07:43 JosephQuinn 阅读(162) | 评论 (0) | 编辑收藏

3.3.2 HTML Parsing and Text Extraction

It is necessary at this point to clarify why LS extraction cannot be applied directly on the raw web page which are downloaded in full size without any parser. The first is, other than pure text information retrieval, the web pages have their unique feature, HTML tags, which help to construct page template, font format, font size, images insertion and other components for a fancy appearance. However, these good looking gadgets in the web pages actually are the sources of distractions and interferences when the applications are trying to analyze them. Because only the showing text part in a web page is useful in common sense. How to transfer the HTML page to pure text by removing all kinds of hidden tags is a key issue to the following steps and decide the final results. The text in the page must be all extracted at first, meanwhile, the tags information behind the text can not be simply discarded, for example, in Michal’s research, she classified and saved the text into 6 different categories, each category takes a unique weight. The second is, the link information is also a powerful hint in deciding the unique feature of a particular web page. For example, the commercial search engines largely depend on the algorithms like page-rank and authority and hubs. Even for searching and retrieval studies in academic papers, the citation rank algorithm is also widely accepted. However, not same like academic papers, which contain the citations in the end of each paper as a references chapter, web pages’ link information hides in the anchor tags, which leads to more complicated data-source preprocessing before LS extraction. Construct a query with extracting the link information, such as the domain that the page belongs to, combined with LS could be another study but not included in this report.

posted @ 2009-06-18 07:42 JosephQuinn 阅读(225) | 评论 (0) | 编辑收藏

3.3.1 Page Quality

摘要: v\:* {behavior:url(#default#VML);} o\:* {behavior:url(#default#VML);} w\:* {behavior:url(#default#VML);} .shape {behavior:url(#default#VML);} Normal 0 7.8 pt 0 2 false fals... 阅读全文

posted @ 2009-06-18 07:41 JosephQuinn 阅读(466) | 评论 (0) | 编辑收藏

3.2.4 Yahoo web search API and news search API

Pros: simple REST API which can be directly accessed via Javascript or virtually any server-side language; offers up to 1,000 results in chunks of up to 50 at a time; offers results in at least three different formats: xml, JSON, and serialized php.

Cons: no access to Google search services, only Yahoo; only offers up to 1,000 results in chunks of up to 50 at a time; per-IP address rate limit of 5,000 queries/24-hour period; absolutely no UI elements. It brings difficulty in programming, such as if people want to build a client-side approach, they have to build it from scratch almost. People are expected to route any links through Yahoo’s servers so they can track traffic.

Figure3.9 Yahoo Web Search API Java snippet

Figure3.9 is an implementation of Yahoo web search API, with typical address in red box. While applying Yahoo news search API, the address in red square can be changed into

String se = "http://search.yahooapis.com/NewsSearchService/V1/newsSearch?appid="+uid+"&query="+query+"&results="+total+"&language=en";

posted @ 2009-06-18 07:25 JosephQuinn 阅读(288) | 评论 (0) | 编辑收藏

3.2.3 Extract Google Results by HTML Parsing.

Both Google Ajax and Google Base Data API have restriction on the number of results and the client application does not have a free control like manually browsing the web page. It appears that Google is checking the User-Agent header to make sure that people are doing this from a browser that it knows about, otherwise it will deny people’s access (i.e. they don't want people doing this from an application unless they use their API service such as section 3.2.1 and section 3.2.2. It blocks the client’s java program access). Now, one method is found by simply changing some parts of java code. By following the piece of code in the red box can solve this problem.

Figure3.8

posted @ 2009-06-18 07:23 JosephQuinn 阅读(178) | 评论 (0) | 编辑收藏

3.2.2 Google Base Data API

Google Base Data API is another way for programmers who want to write client applications that interact with Google Base. With the query like this:

http://base.google.com/base/feeds/snippets?bq=query&key=xxxxxxx

The value of “bq” is the input query and the value of “key” is a unique ID provided by Google after registered in Google. The result is returned in clean XML format. However, there are still some limits, with the same query, Google Base Data API returns different results compared to its original web interface.

Figure3.7 Google Base Data API snippet

posted @ 2009-06-18 07:21 JosephQuinn 阅读(208) | 评论 (0) | 编辑收藏

3.2.1 Google Ajex

Google’s SOAP Search API would have the ability to access Google’s results, but it has been deprecated since late 2006 and no new license keys have been distributed. Instead of SOAP API, Google released its new Ajax for researchers and the search query should like this:

http://ajax.googleapis.com/ajax/services/search/web?start=1&rsz=large&v=1.0&q=“query”

Figure3.6 Google Ajax code snippet

The result is returned in JSON format which can be parsed by JSON library from http://www.json.org/java/. By changing the “start=” value, the offset is decided easily. However, using new API Google Ajex, the maximum number of result URL returned from Google is limited. In 2008, the maximum number is 32. Currently, the maximum number is 64.

posted @ 2009-06-18 07:19 JosephQuinn 阅读(221) | 评论 (0) | 编辑收藏

3.2 Search Engine Selection

As 2 of the most powerful search engines, Google and Yahoo have the strongest abilities in searching the surface web and they also provide all kinds of different special search functions such as web search, news search, image search which are familiar by the people all over the world. A general experiment in exploring the search abilities without testing these 2 search engines is certainly not conclusive.

Meanwhile, the detailed search ability test implementation on Google, Yahoo or other search engine needs programming according to their result pages, specifically speaking, they are shown in different HTML templates. For example, Figure3.3 and Figure3.5 are the returned pages from Google and Yahoo, with the same query “job search”, currently, the differences or quality are not compared in this section. Although the 2 result pages Figure3.3 and Figure3.5 show a very similar format such as the search engine input interface at the top, the main content is in left and takes more than 2/3 spaces and leaving the right 1/3 to commercial websites as advertisements, the HTML behind the pages show quite different grammar and make extracting the result title, result summary, result URL not a single general template, but specifically one extracting algorithm for one search engine.

Here is a segment from Google result page, the texts shown in Figure3.3 are also in the red boxes in Figure3.2.

Figure3.2

Figure3.3

Here is one segment from Yahoo web search result page.

Figure3.4

Figure3.4 is a segment from Yahoo result page and the text shown in Figure3.5 is also in the red square in Figure3.4.

Before parsing the HTML in Figure3.2 and Figure3.4, both Google and Yahoo provide easier ways for the developers to parse and extract results information. It is quite necessary to carefully examine their API first. Meanwhile different kinds of API provided by Google and Yahoo show quite different capabilities in returning the result links.

Figure3.5

posted @ 2009-06-18 07:16 JosephQuinn 阅读(181) | 评论 (0) | 编辑收藏

3.1 Experiments Steps

The project is designed to find a best query (LS query) which can represent the web page and be searched by search engine easily. The web pages are picked up from the surface web which is easily accessed without any authority restrictions. The experiments are arranged in following general steps:

(1) Find a large amount of web pages which will be used for LS extraction and re-finding/relocation, download them as test pages into local disk and save their URLs at the same time. The source of web pages used for experiments requires careful considerations as all kinds of web sites showing up today, some of them are ill-formatted or have poor information. Obviously, they are not the ideal sources in experiments. The detail of web pages selection and crawling is non-trivial and will be introduced in section 3.3.1 specifically.

(2) HTML parsing and text extraction are needed before extracting LSs. The preprocessing reasons and steps will be in section 3.3.2.

(3) Apply the algorithms in chapter 2 which are designed to extract lexical signatures from these downloaded pages. In chapter 4, TF, DF, TFIDF, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1 as well as word rank and sentence rank will be applied to the pages after step2. But different from S. T. Park’s experiments, in this project, varying term numbers in a query is accepted, which make such as TF3DF2 and TFIDF3DF2 to and , the ratio in selecting between DF order terms and TF or TFIDF order terms is not changed. For convenience, the selections on TF, DF and their mixed forms are listed from (a) to (h) which have been precisely described in S. T. Park’s paper. Although the topic of how many terms are going to help getting better searching results is studied and unveiled by Martin and Michael, they claim that 5 terms a query is the most efficient number in getting the desired result appeared in top 10 from SE ^[3]. More terms in a query means obviously more feasible in sentence-rank. Meanwhile, Google raised their web search limit to 32 words 4 years ago back to 2005. In this project, up to 15 words are allowed per query, with succeeding words being ignored. The performance with different number of terms in a query is an interesting topic and they will be explored in chapter 4.

a) TF: Select terms in decreasing term frequency (TF) order. If there is a tie, then pick terms based on increasing document frequency (DF). If there is another tie, randomly select the terms ^[5].

b) DF: Select terms in increasing DF order. If there is a tie, then pick terms based on decreasing TF order. If there is another tie, randomly select the terms ^[5].

c) TFIDF: Select terms in decreasing term-frequency inverse-document frequency (TFIDF) order. If there is a tie, then pick terms based on increasing DF order. If there is another tie, randomly select the terms ^[5].

d) PW: Select terms based on Phelps and Wilensky’s ^[3] method: First, list terms in TF order and then pick LS in decreasing TFIDF order (i.e., decreasing TFIDF order where the TF term is capped at five). If there is a tie, then pick terms based on increasing DF order. If there is another tie, randomly select the terms ^[5].

e) TF3DF2: Select two terms in increasing DF order. If there is a tie, then pick terms based on decreasing TF order. If there is another tie, randomly select the terms. Then filter out all terms which have DF value 1. Select three terms maximizing TF. If there is a tie, it is resolved the same way as with TF method ^[5].

f) TF4DF1: Select one term based on increasing DF order first. Then filter out all terms which have DF value 1. Select four terms maximizing TF ^[5].

g) TFIDF3DF2: Select two terms based on increasing DF order first. Then filter out all terms which have DF value 1. Select three terms maximizing TFIDF ^[5].

h) TFIDF4DF1: Select one term based on increasing DF order first. Then filter out all terms which have DF value 1. Select four terms maximizing TFIDF ^[5].

(4) Select several search engines which support developer mode. Google web search and news search, Yahoo web search and news search are picked as general search engines in the test. More details are in section 3.2.

(5) Download the result links in first page returned by SE as well as their corresponding HTML pages. If there is a URL match between test URL and result URL, a successful retrieval is recorded. However if there is no match between test URL and first 10 result URLs, then, comparison between the test page and result pages from search engine is done to see if they have the same topic or content. It is similar to step 2 because both of them need HTML page parsing, extraction and converting them to understandable pure text without noises from HTML tags. Also, the criterion of considering whether

2 pages are in the same topic or have similar content is another important issue in this project. It has been introduced in the beginning of section 2.5. The details will be discussed in section 3.4.

(6) Repeat step 2 to step 5 on all test pages downloaded by step 1, count all the success retrievals and compute a success retrieval rate which is in 3.1. This is the most straightforward measurement which is also adopted in both S. T. Park and Xiaojun Wang’s researches. Martin and Michael’s LS score is also a good way to evaluate the query generation, however, their score is not as straightforward as 3.1

3.1

(7) A stable and reliable LS extraction algorithm should help the success retrieval rate be higher than 90% as the ultimate requirement. The traditional ways which do not maintain the linguistic meaning such as TF shows quite different results comparing to graph-based rank algorithms which maintain the linguistic meaning. For example, after step 3, all terms are listed in alphabetic order and query’s terms are picked from a to z. In chapter 4, the results from Google and Yahoo show both of those search engines have a strong ability in returning desired pages by meaningful query rather than discrete terms simply ordered from TF, IDF and so on, even some stop words are included. The results and analysis will be in chapter 4.

Figure3.1

Figure3.1 is the flow chart showing the overall procedure in experiments including from step 1 to step 7. The URL in the left top rectangle of Figure3.1 is a directory built with a large amount of effective URL which can be accessed online. The corresponding pages are then downloaded and saved on local disk.

The algorithms in extracting lexical signatures from a page are in 3 rectangles in the right top of Figure3.1. They are the core in this project. Another key programming issue is 2 rectangles in button of Figure3.1 page content comparison and results record. The middle rectangles in Figure3.1’s left and right are a pipeline showing all the operations.

posted @ 2009-06-18 03:26 JosephQuinn 阅读(293) | 评论 (0) | 编辑收藏

2.5.4 Sentence-Rank on Web Pages

Similar to Xiaojun Wang’s applying word rank on web pages, this project applies sentence rank on web page. The passages in the web pages can be extracted as shown in Figure2.19’s red square.

Figure2.19

However, there are conditions sentence rank can not work. Some web pages may not have sentences or passages, which make sentence rank on those pages not effective when there are only titles or phrases. Take Figure2.20 as an example, there is not any passage in Yahoo’s home page, and although there are several sentences in the center 3 red squares which are shown in anchor tags separately, it brings difficulties to construct connections among those independent sentences, because they actually come from completely different topics. Meanwhile, there are a bunch of simple words and phrases in the left blue squares, such as “answer”, “auto” and “finance”. It brings challenges in combining the terms and sentences as well as applying sentence rank. Therefore, the page like Figure2.20 is not a typical type can be applied by sentence rank.

Figure2.20 A typical example of link-based page

There is a simple way to exclude the pages which are not suitable for sentence rank. A threshold p is defined to separate the pages into 2 categories linked-based page and content-based page after using formula 2-11.

2-11

The pages like Figure2.20 can be concluded as a linked-based page which has a high portion with text in link. The linked-based pages are easily found from the websites’ home page and index page. Compared to Figure2.19, a content-based page has high portion in plain text without link such as Figure2.20.

posted @ 2009-06-18 03:20 JosephQuinn 阅读(263) | 评论 (0) | 编辑收藏

2.5.3 Sentence-Rank

In previous section, a graph-based ranking algorithm on words is introduced. In this section, the vertex in the graph is applied on the sentences. Compared to word-rank, sentence rank preserves much more linguistic information and the results from sentence rank are more understandable because now the LS is composed by whole sentence rather than terms and phrases.

Similar to the word rank graph, without considering the un-weighted graph, there are 2 kinds of graph construction for sentences in a document: undirected weighted graph and directed weighted graph. Moreover, Rada and Paul proposed another form of directed weighted graph and they classify directed weighted graph into directed forward weighted graph and directed backward weighted graph ^[9]. In undirected weighted graph, the assumption is that every sentence in the document has connection with all other sentences. In directed forward weighted graph, every sentence points to all the following sentences in text, while every sentence receives all the previous sentences in text. In directed backward weighted graph, every sentence points to all the previous sentences in text, while every sentence receives all the following sentences in text.

Figure2.17

Figure2.18 shows a 5-sentences passage. Figure2.18 shows the 3 graphs in detail. In Figure2.18 (a), as an undirected graph, every sentence has connection with other 4 sentences with a pair of in coming pointer and out coming pointer. In Figure2.18 (b), as a directed forward graph, sentence 1 points to sentences 2, 3, 4, 5; sentence 2 points to sentences 3, 4, 5; sentence 3 points to sentences 4, 5; sentence 4 points to sentence 5 and sentence 5 does not have out coming pointer. As a directed backward graph, Figure2.18 (c) shows the (b)’s reversed order, sentence 5 points to sentences 1, 2, 3, 4; sentence 4 points to sentences 1, 2, 3; sentence 3 points to sentences 1, 2; sentence 2 points to sentence 1 and sentence 1 does not have out coming pointer.

a. undirected b. directed forward c. directed backward

Figure2.18

Therefore, after the direction being decided, the weight between the connections is about computing the 2 sentences’ similarity, as in equation 2-10, the common terms shared by both sentences are divided by the length of both sentences in log form. Length can be simply considered as the terms number in sentence.

^[10]2-10

Generally, W_k is a word appearing in both S_i and S_j. W_k can also be the words with same meaning but different forms such as “interested” and “interesting” which could finally prove the ability in finding the similar or related documents. If there is no common word W_k, then the edge between S_i and S_j can be removed.

After iterations on 2-8 and 2-9, the sentences from text with the highest rank are selected and taken as an abstraction. In [9], 4 highest ranked sentences are extracted after the iteration, resulting in a summary of about 100 words. Rada and Paul also evaluated the abstraction from this graph ranking algorithm with other methods such as HITS and Positional Power Function (POS) by ROUGE evaluation ^[15]. It turned out the sentence rank from page rank provided good performance. The details are not included in this report.

After all, Rada and Paul concluded that, based on this sentence rank algorithm, the abstraction from the extracted sentences are more informative for the given text and it does not require deep linguistic language nor domain or language specific annotated corpora ^[10].

posted @ 2009-06-18 03:15 JosephQuinn 阅读(273) | 评论 (0) | 编辑收藏

2.5.2 Word-Rank on Web Pages

Xiaojun Wang applied the word-rank algorithm to the LS extraction on web pages for finding lost or related web pages ^[11]. He also compared the results from this graph-based ranking algorithm with the traditional ways in extracting the terms from documents, such as TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and TFIDF4DF1. He pointed out, with word rank algorithm, which takes the semantic relation between words into account and chooses the most representative and salient words as Lexical Signatures, the highly relevant web pages can be found when the desired web pages can not be retrieved ^[11].

In their experiments, Wang not only used the basic word-rank algorithm, but also combined it with DF to select the terms. In [11], Wang only constructed undirected weighted graph model G=(V, E), V is the vertex set containing all words except stop words. E is a set of undirected and weighted edges. Each vertex’s initial score is the normalized value and set the damping factor d=0.85. Wang did not use window size but WordNet ^[14] to recognize the semantically related words and Wang did not mention the value of weight on edges. These 2 detailed implementations are definitely related to a large amount of work but they were not listed in their paper “WordRank-Based Lexical Signatures for Finding Lost or Related Web Pages” ^[11].

^[11] 2-8

^[11] 2-9

Similar to 2-7 and set the convergence threshold to 0.0001, 2-8 is run on the graph until it converges based on 2-9.

In Wang’s experiments, he selected Google and MSN Search, randomly crawled 2000 URLs from domain DMOZ.org and kept 1337 pages after filtering out the unqualified pages, such as too short in content, non-HTML format like .pdf, .ps and .doc. Wang constructed each query with 5 terms by implementing TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1, WordRank, WordRank3DF2 and WordRank4DF1. Including the unique returned by SE, top 1 and top 10, the average success rate among 1337 pages are generally from 40% - 60%, for Google, except WordRank3Df2 which is a little higher than 60%. Meanwhile, the results from MSN show poor performance in TF, which is lower than 30%, TFIDF is lower than 40%, the others are between 40% and 60%.

Figure2.15 Retrieval performance of LS from Google search ^[11]

Figure2.16 Retrieval performance of LS from MSN live search ^[11]

In the summarization, Wang concluded that DF was the best method for uniquely identifying the desired documents; TF was easy to compute and did not need to be updated unless documents were modified; TFIDF and the hybrid method combining TFIDF and DF were good candidates for extracting the desired documents ^[11]. By computing the average cosine similarity values of top 10 returned pages with the desired page, WordRank based methods such as WordRank3DF2 are best for retrieving highly relevant documents ^[11].

posted @ 2009-06-18 03:08 JosephQuinn 阅读(234) | 评论 (0) | 编辑收藏

2.5.1 Word-Rank

Word-rank is one implementation of weighted graph ranking algorithm including undirected weighted (on edges) graph and directed weighted (on edges) graph when a single word/term is considered as a vertex and all content is a graph. A window size parameter ‘w’ is introduced for implementing connection among vertices. In undirected weighted graph, each word has connection with other words only in the window size distance, including previous w words and following w words. In directed weighted graph, each word has connection with the following words only in the window size distance. Take Figure2.14 as an example and set window size to 2, ‘packing’ has connections with ‘ferocious’, ‘storm’, ‘freezing’ and ‘rain’ in undirected weighted graph, while it only has connections with ‘freezing’ and ‘rain’ in directed weighted graph. The score associated with each vertex is set to an initial value of 1 and ranking algorithm, 2-6 for undirected weighted graph and 2-7 for directed weighted graph, is run on graph repeatedly until it converges – usually for 20-30 iterations, at a threshold of 0.0001 ^[9].

Figure2.14

The expected end result for this application is a set of words or phrases that are representative for a natural language text. The terms to be ranked are therefore sequences of one or more lexical units extracted from text, and these represent the vertices that are added to the text graph. If more than 1 term happened to be neighbors, they can be connected as a key phrase. Thus, the language consistency in the content is preserved. Rada and Paul in their paper “TextRank: Bring Order into Texts” gave a clear view of this passage “Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.” and extracted the following terms as results: “linear constraints”, “linear Diophantine equations”, “natural numbers”, “nonstrict inequations”, “strict inequations” and “uper bounds”^[9].

posted @ 2009-06-18 02:54 JosephQuinn 阅读(208) | 评论 (0) | 编辑收藏

常用链接

留言簿

随笔分类

随笔档案

Core Java

最新随笔

搜索

最新评论

阅读排行榜

评论排行榜

String Streams

Maven2 Installation

Windows 2000/XP/Vista

Create First "Helloworld" Spring Project by Maven2.