String Streams
The stringstream is a class that is useful for extracting data from or writing formatted data to strings.
A very common question, for example is this: ``How do I convert a string to a number ? ''. Of course, there's no way to do so in general, since not all strings look like numbers. But it's certainly not unusual that we'd want to extract a number from a string. Note that the g++ compiler ships without the sstream header.
FAQ: How Do I Convert String To Number ?
1 #include <iostream>
2 #include <sstream>
3 using namespace std;
4 int main()
5 {
6 int a;
7 string s = "456";
8 istringstream sin(s);
9 sin >> a;
10 cout << a << endl;
11 return 0;
12 }
The istringstream class is an input stream attached to a string. The constructor copies the string s into its private buffer, and then we may extract data from it using regular stream semantics, in this case we use
sin>>a.
FAQ: How Do I Convert A Number To A String ?
or how do I do ``sprintf'' in C++ ?
You may have guessed -- the answer is to use an ostringstream. This data type behaves in a similar way to the istringstream. The main difference is that there is an extra method, str(). This method returns the string that lies in the ostringstream object. There's also a verision of str() that takes a string argument -- this initialises the streams underlying string buffer to that argument. This is commonly used to clear a stream for reuse -- one can call mystream.str("");
1 #include <sstream>
2 #include <iostream>
3 int main()
4 {
5 std::ostringstream strout;
6 int x = 42;
7 strout << "The answer to the question is " << 42 << std::endl;
8 cout << strout.str() << endl;
9 strout.str("");
10 strout << 53.2;
11 cout << strout.str() << endl;
12 return 0;
13 }
Using Stringstreams To Parse Input
A problem that often comes up is this: suppose you have written the following code:
1 #include <iostream>
2
3 using namespace std;
4 int main()
5 {
6 int x;
7 do
8 cout << "Enter a positive integer (0 to quit)" << endl;
9 while ( cin >> x && x != 0 );
10 return 0;
11 }
What happens if the user enters
2 3
Or worse, if they enter:
5.0
The problem is that the extraction operator does not expect each item extracted to be on a seperate line. So if you do expect this, you need to be explicit about it in your code. The way to do this is use getline() to read a line of input into a string and then use that string to create an istringstream from which we can extract data. After we've extracted the data we need, we can check for trailing garbage. Here's an example:
1 #include <sstream>
2 #include <iostream>
3
4 using std::cin;
5 using std::cout;
6 using std::end;
7
8 int main()
9 {
10 int x;
11 char ch;
12 std::string myString;
13 while (getline ( cin, myString ))
14 {
15 std::istringstream strin(myString);
16 strin >> x;
17 if (!strin)
18 cout << "Bad input" << endl;
19 else if ( strin >> ch )
20 cout << "Bad input" << endl;
21 else
22 cout << "You entered: " << x << endl;
23 }
24 return 0;
25 }
Some notes:
* The istringstream object is declared within the loop, so its scope is the block of the loop. So a new istringstream object is created and the old one is destroyed for each iteration of the loop.
* The test (!strin) checks to see if the stream is in an error state.
* The attempt to extract a character is a test to see if there's anything left in the stream after we extract an integer. Note that this ignores whitespace (which is the desired effect in this case)
* If there are no problems, then the extraction was succesful.
strstream considered harmful
There exists a deprecated class similar to stringstream that is called strstream. Do not confuse these two classes. They are not the same. strstream has a very error prone interface because of the way it handles memory. One problem is that it does not append trailing nulls when str() is called. So you must append a trailing null, or std::ends. Another important thing to remember is that if you use ostrstream with a dynamic buffer, like this:
std::ostrstream strout;
strout << "The answer is ..." << 42 << std::endl << std::ends;
strout.str();
Then calling str has the peculiar side effect that the caller is responsible for managing the memory allocated by strout's buffer, which is pretty silly since the caller does not know how the memory was allocated (it may be allocated using malloc() or new). So to make the stupid thing take its memory back, one makes the following call:
strout.freeze(0);
It's worth mentioning that there is another, safer way to use ostrstream and that is to use it with a static buffer. If you do this, you don't need to deal with this freeze() nonsense. To do this, call the constructor that takes a character array as an argument.
char a[100];
std::ostrstream strout(a,100);
strout << "the answer is" << 42 << std::endl << std::ends;
std::cout << a << std::endl;
摘要: Here is an example validator for bean class User.
RegisterAction-validation.xml
Code highlighting produced by Actipro CodeHighlighter (freeware)
http://www.CodeHighlighter.com/
--> 1&...
阅读全文
摘要: Here is a interceptor example in ssh structure web application.
First of all, I write a login interceptor and add it into default interceptor stack. It only can print out on console without any oth...
阅读全文
摘要: In this article, I will introduce 3 ways in writing validators in Struts2-Spring-Hibernate application.
First of all, create package com.ssh.bean and create a bean named 'User' under this package:
...
阅读全文
摘要: It's only the simpliest Struts2-Spring-Hibernate web application which only has a login function. I will call it SSH which is apparently different from the one appeared in Linux. I assume the installa...
阅读全文
I give a basic helloworld Spring example which references the example from "Spring in Action" chapter2.
Of course JDK must be installed, %JAVA_HOME% and both %Path% and %Classpath% are filled with correct path.
Maven2 Installation
Because it's the first example showing how Maven works in Spring, all the mandatory steps are included without other optional configurations.
Windows 2000/XP/Vista
1. Unzip the distribution archive, i.e. apache-maven-2.0.10-bin.zip to the directory F:\Development\j2eelib\apache_maven\apache-maven-2.1.0
2. Add the M2_HOME environment variable by opening up the system properties (WinKey + Pause), selecting the "Advanced" tab, and the "Environment Variables" button, then adding the M2_HOME variable in the user variables with the value F:\Development\j2eelib\apache_maven\apache-maven-2.1.0
Be sure to omit any quotation marks around the path even if it contains spaces. Note: For Maven < 2.0.9, also be sure that the M2_HOME doesn't have a '\' as last character.
3. In the same dialog, add the M2 environment variable in the user variables with the value %M2_HOME%"bin.
4. Optional: In the same dialog, add the MAVEN_OPTS environment variable in the user variables to specify JVM properties, e.g. the value -Xms256m -Xmx512m. This environment variable can be used to supply extra options to Maven.
5. In the same dialog, update/create the Path environment variable in the user variables and prepend the value %M2% to add Maven available in the command line.
6. In the same dialog, make sure that JAVA_HOME exists.
7. Open a new command prompt (Winkey + R then type cmd) and run mvn --version to verify that it is correctly installed.
Create First "Helloworld" Spring Project by Maven2.
Step 1.
Open windows cmd and go into a directory which is prepared to be the project's location.
e.g. F:\Development\Java\maven\
mvn archetype:create -DgroupId=com.springinaction.chapter01 -DartifactId=springinaction
Step 2.
Go into sub-directory springinaction which is created by mvn command above: cd springinaction
It's worth to mention that checking all the available dependencies on Maven's website as well as their version number is necessary:
http://mvnrepository.com/artifact/org.springframework/spring
Config pom.xml first:
1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3 <modelVersion>4.0.0</modelVersion>
4 <groupId>com.springinaction.chapter01</groupId>
5 <artifactId>springinaction</artifactId>
6 <packaging>jar</packaging>
7 <version>1.0-SNAPSHOT</version>
8 <name>springinaction</name>
9 <url>http://maven.apache.org</url>
10 <dependencies>
11 <dependency>
12 <groupId>junit</groupId>
13 <artifactId>junit</artifactId>
14 <version>3.8.1</version>
15 <scope>test</scope>
16 </dependency>
17
18 <!-- Choose to add each module as a dependency as it's needed.-->
19 <dependency>
20 <groupId>org.springframework</groupId>
21 <artifactId>spring</artifactId>
22 <version>2.5.6</version>
23 </dependency>
24
25 <dependency>
26 <groupId>org.springframework</groupId>
27 <artifactId>spring-aop</artifactId>
28 <version>2.5.6</version>
29 </dependency>
30
31 <dependency>
32 <groupId>org.springframework</groupId>
33 <artifactId>spring-beans</artifactId>
34 <version>2.5.6</version>
35 </dependency>
36
37 <dependency>
38 <groupId>org.springframework</groupId>
39 <artifactId>spring-core</artifactId>
40 <version>2.5.6</version>
41 </dependency>
42
43 </dependencies>
44 </project>
Edit App.java which have been created already or create any new java which is needed.
1 package com.springinaction.chapter01;
2 import org.springframework.beans.factory.BeanFactory;
3 import org.springframework.beans.factory.xml.XmlBeanFactory;
4 import org.springframework.core.io.FileSystemResource;
5
6 import com.springinaction.chapter01.service.GreetingService;
7
8 public class App{
9 public static void main( String[] args ){
10 BeanFactory factory = new XmlBeanFactory(new FileSystemResource("hello.xml"));
11
12 GreetingService greetingService = (GreetingService) factory.getBean("greetingService");
13 greetingService.sayGreeting();
14 }
15 }
Create service directory under com.springinaction.chapter01 and create GreetingService.java:
1 package com.springinaction.chapter01.service;
2
3 public interface GreetingService{
4 void sayGreeting();
5
Create serviceimpl directory under com.springinaction.chapter01.service and create GreetingServiceImpl.java
1 package com.springinaction.chapter01.service.serviceimpl;
2
3 import com.springinaction.chapter01.service.GreetingService;
4
5 public class GreetingServiceImpl implements GreetingService{
6 private String greeting;
7 public GreetingServiceImpl(){}
8 public GreetingServiceImpl(String greeting){
9 this.greeting = greeting;
10 }
11 public void sayGreeting(){
12 System.out.println(greeting);
13 }
14 public void setGreeting(String greeting){
15 this.greeting = greeting;
16 }
17 }
The directories structure should be like this in the red square.
The directories service and serviceimpl are created manually, but others are created by Maven automatically.
Create hello.xml under root directory springinaction
1 <?xml version="1.0" encoding="UTF-8"?>
2 <beans xmlns="http://www.springframework.org/schema/beans"
3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4 xsi:schemaLocation="
5 http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-2.0.xsd">
6
7 <bean id="greetingService" class="com.springinaction.chapter01.service.serviceimpl.GreetingServiceImpl">
8 <property name="greeting" value="Hello World in Spring Bean!"/>
9 </bean>
10
11 </beans>
Step3:
Under cmd mode and in the root directory springinaction:
mvn compile
mvn test
or
mvn package
And then:
java -cp target/springinaction-1.0-SNAPSHOT.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\spring.jar;F:\Development\j2eelib\spring-framework-2.5.6\lib\jakarta-commons\commons-logging.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-aop.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-beans.jar;F:\Development\j2eelib\spring-framework-2.5.6\dist\modules\spring-core.jar com.springinaction.chapter01.App
Notice: Change your cp parameters according to specific location in your environment.
It is important to include all necessary jar files which are needed in the project.
-cp means classpath.
Result:
After debug, make clean is necessary which will delete all compiled classes, jar and their directories.
mvn clean
摘要: Normal
0
7.8 pt
0
2
false
false
false
EN-US
ZH-CN
X-NONE
MicrosoftInternetExplorer4
...
阅读全文
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
Due to the
excellent performance by sentence rank, a further experiment is conducted:
applying sentence rank on real news web pages. In this section, due to the
length of report, only implement undirected graph and 10 terms per query, the
following success retrieve rate shows a high percentage value when the cosine
similarity on 2 web pages is applied by using 4.1and 4.2. 10 terms a query means only take first 10
words in the selected sentence including stop words which is consistent with
section 4.6. Unlike locating the exact address of a web page itself, this
comparison leads to find similar topic document by comparing 2 different URL
web pages, the details are all introduced in section 3.4.
4.1
4.2
Meanwhile, there
are 3 search engines employed in this section: Yahoo News Search, Yahoo Web
Search and Google Web Search. Unlike from section 4.6 to section 4.1 which only
count URL string match as success retrieval, section 4.7 take document
similarity into consideration, and if equation 4.2’s value is bigger than 0.9, which is also
permitted in S. T Park and Xiaojun Wang’s research, a success retrieval is
considered effective. There are 183 pages in this section which are all from
May 4, 2009, Yahoo News, and all related URL addresses are listed in Appendix
D.
|
Success Counts
|
Success Rate
|
Yahoo News Search
|
171
|
93.44%
|
Yahoo Web Search
|
178
|
97.27%
|
Google Web Search
|
177
|
96.72%
|
Table4.29
(a) (b)
Figure4.32
As Figure4.32’s (b) shows, the success rate is above 90%
which satisfies the project’s initial requirements by applying a single text
retrieval method.
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
As stated in
chapter 3, the sentence rank can significantly improve the linguistic
summarization other than the traditional TF or DF methods. Considering the complexity
in sentence rank, randomly pick a sentence and take the first 3 to 15 words
from the sentence within its original order as search query can avoid the
iterations in graph-based ranking algorithm, and the results below show that
even the sentences are randomly picked, when the number of terms up to 10, the
performance increases enormously, some of them are higher than 75%, which cannot
be accomplished easily by the previous carefully designed retrieval algorithms.
Random Sentence
|
Google
|
Yahoo
|
3
|
88.00
|
39.11%
|
90.00
|
40.00%
|
4
|
114.00
|
50.67%
|
102.00
|
45.33%
|
5
|
134.00
|
59.56%
|
124.00
|
55.11%
|
6
|
150.00
|
66.67%
|
144.00
|
64.00%
|
7
|
155.00
|
68.89%
|
137.00
|
60.89%
|
8
|
162.00
|
72.00%
|
154.00
|
68.44%
|
9
|
161.00
|
71.56%
|
144.00
|
64.00%
|
10
|
168.00
|
74.67%
|
151.00
|
67.11%
|
11
|
168.00
|
74.67%
|
151.00
|
67.11%
|
12
|
170.00
|
75.56%
|
168.00
|
74.67%
|
13
|
172.00
|
76.44%
|
168.00
|
74.67%
|
14
|
171.00
|
76.00%
|
169.00
|
75.11%
|
15
|
175.00
|
77.78%
|
174.00
|
77.33%
|
Average
|
152.92
|
67.97%
|
144.31
|
64.14%
|
Table4.24
(a) (b)
Figure4.26 Random Sentence Pick from Google and Yahoo results
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
The text in HTML’s
title tag is always playing a vital role in web page retrieval. During the
beginning of this project, an extensive amount of experiments were conducted by
using the title method. It was believed that the success rate would reach 90%
from using title text as a query if the query could be composed carefully and
properly. Figure4.12 shows that the title method also has a
good stability along with the words number in a query. It is important to
mention that, from Figure4.1 to Figure4.10, although the classic methods have better
results, it only means the HTML extractions have good performance, which filter
the structural HTML tags and functional scripts which could be big distractions
in the following application on the target page, because all the basic
retrieval process is only designed for pure text without structural tags. For
example, HTML tags like ‘td’ and ‘tr’ will have a big term frequencies and the
function or variable names in Javascript will cause a very low document
frequencies, if they are not filtered or removed in the pre-processing step.
However, by using title method, it is much easier to extract the text
information only between <title> and </title>.
Title tag
|
Google
|
|
Yahoo
|
|
3
|
82.00
|
36.44%
|
72
|
32.00%
|
4
|
91.00
|
40.44%
|
86.00
|
38.22%
|
5
|
111.00
|
49.33%
|
94.00
|
41.78%
|
6
|
116.00
|
51.56%
|
99.00
|
44.00%
|
7
|
116.00
|
51.56%
|
102.00
|
45.33%
|
8
|
115.00
|
51.11%
|
102.00
|
45.33%
|
9
|
115.00
|
51.11%
|
101.00
|
44.89%
|
10
|
115.00
|
51.11%
|
101.00
|
44.89%
|
11
|
115.00
|
51.11%
|
102.00
|
45.33%
|
12
|
117.00
|
52.00%
|
102.00
|
45.33%
|
13
|
118.00
|
52.44%
|
103.00
|
45.78%
|
14
|
126.00
|
56.00%
|
111.00
|
49.33%
|
15
|
127.00
|
56.44%
|
112.00
|
49.78%
|
Average
|
112.62
|
50.05%
|
99.00
|
44.00%
|
Table4.11
(a) (b)
Figure4.12 Use title terms as search query
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
A real
implementation from this project is whether the ability of testing on general
search engine can be applied on testing deep web search engine. The general
search engines such as Google and Yahoo have been widely approved in their
proper results and links. However, many sites may not allow their documents to
be indexed but instead may allow the documents to be accessed through their
search engines only, these sites are part of the so-called Deep Web [1][17].
The deep web search engines which only focus on their own data base and pages,
data or documents which are kept privately and cannot be searched by general
search engines. Take www.taobao.com as an
example, it is a online commercial trading site like www.ebay.com, Taobao apparently abandons
general search engines such as www.baidu.com
and www.google.com to access its
commodities results after the negotiations broken with the big search engine
companies. This leads people who want commodity and price information have to
go directly to Taobao’s own search engine interface and browse result items in
Taobao’s website. Obviously, search engines in Taobao are probably developed by
their own or contract consultant software teams, the performance then will be
an interesting topic rather than the ones generally accepted by the public such
as Google and Yahoo. The specific introduction for deep web and implementation
of deep web search engines are not part of this project, but the practical
value from this project can offer a feasible way in testing local and small
search engines embedded in their own web sites.
If there is a URL
match or content match, a success retrieval is established. If the URL does not
match, due to URL’s changing all the time [2][3], comparison between
original page and retrieved pages is indispensable and taken by 2 ways,
manually and automatically. Manually checking all the content between original
page and retrieved pages is time consuming but it can guarantee the precise
results. In this project, we pick around 200 pages from the data source for
manual checking. Rather than by brute force, automatic comparison between the
result pages from search engine and each test page also needs HTML page
preprocessing as in step 2.
3-1
3-2
In 3-2, TFw is
the word’s term frequency in document 1 or document 2. In this project, some necessary removing are applied
on pages, therefore, the comparison between 2 pages is only focusing on the
main content which means all the advertisement, copyrights information, sponsor’s
links and information are removed. It can be concluded as finding a similar topic
within 2 different pages. Here are 3 pairs of example pages listed from Figure3.22 to Figure3.24. By using undirected weighted sentence
rank algorithm, the highest ranking sentence can be picked up, input as a query
into SE and then compared to the result page.
Figure3.22 (a)
and (b) is an example of proving the validity of
cosine comparison. The post time is shown in the red circle. In Figure3.22 (a), it doesn’t show the date but “34 mins ago”. In Figure3.22 (b),
it shows “Mon Mar2, 11:57pm ET”. Actually, Figure3.22 (a)
was downloaded in the morning on March 2, 2009 and Figure3.22 (b)
was downloaded at noon on the same day. Apparently, Yahoo news editors keep
updating and modifying the same news, so the later one gives some differences
in the content but actually they are talking about the same issue.
Figure3.22 shows
the downloaded HTML file images and (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/winter_storm
The retrieval URL
is http://news.yahoo.com/s/ap/20090303/ap_on_re_us/winter_storm_43
By comparing the
different URL, it is obviously that even about the same content, yahoo news
changes URL by adding “_43” in
the end.
(a) (b)
Figure3.22
(a) (b)
Figure3.23
Figure3.23 (a) and (b) is an example of finding a similar content web page,
according to a downloaded local page Figure3.23 (a). Obviously, they
are both talking about the missing NFL player in Florida’s Gulf which is one of the most popular
news at the time of this experiment.
Figure3.23 (a)’s URL is
http://news.yahoo.com/s/ap/20090302/ap_on_re_us/missing_boaters_nfl
Figure3.23 (b)’s URL is
http://www.npr.org/templates/story/story.php?storyId=101375823&ft=1&f=1003
The documents’ similarity is 98.38% by 3-2
Figure3.24 (a)
and (b) is another example of
finding a similar content web page according to a downloaded local page. They
are both talking the children’s blood lead level.
Figure3.24 (a)’s URL is:
http://news.yahoo.com/ /s/ap/20090302/ap_on_bi_go_ec_fi/economy
Figure3.24 (b)’s URL is
http://www.ajc.com/services/content/health/stories/2009/03/02/children_lead_level.html?cxtype=rss&cxsvc=7&cxcat=9
The documents similarity is 94% by 3-2.
(a) (b)
Figure3.24
In chapter2,
section2.1, S. T. Park adopted 5 terms a query. However, a wider range of term
numbers in a query is adopted in this project: the length of LS from 3 to 15
versus the success rate is compared together while sentence rank, take first N
words in the selected sentence, from 3 to 15, even including stop words from
the top ranked sentences as a search query and remove the rest of them left in
the sentences. This procedure does not follow the traditional ways in text
retrieval, however, in chapter 5, the experiments show even better results when
the terms number are more than 10, compared to the same terms number in
traditional ways.
It is necessary at
this point to clarify why LS extraction cannot be applied directly on the raw
web page which are downloaded in full size without any parser. The first is,
other than pure text information retrieval, the web pages have their unique
feature, HTML tags, which help to construct page template, font format, font
size, images insertion and other components for a fancy appearance. However,
these good looking gadgets in the web pages actually are the sources of
distractions and interferences when the applications are trying to analyze
them. Because only the showing text part in a web page is useful in common
sense. How to transfer the HTML page to pure text by removing all kinds of
hidden tags is a key issue to the following steps and decide the final results.
The text in the page must be all extracted at first, meanwhile, the tags
information behind the text can not be simply discarded, for example, in Michal’s
research, she classified and saved the text into 6 different categories, each
category takes a unique weight. The second is, the link information is also a
powerful hint in deciding the unique feature of a particular web page. For
example, the commercial search engines largely depend on the algorithms like
page-rank and authority and hubs. Even for searching and retrieval studies in
academic papers, the citation rank algorithm is also widely accepted. However,
not same like academic papers, which contain the citations in the end of each
paper as a references chapter, web pages’ link information hides in the anchor
tags, which leads to more complicated data-source preprocessing before LS
extraction. Construct a query with extracting the link information, such as the
domain that the page belongs to, combined with LS could be another study but
not included in this report.
摘要: v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
Normal
0
7.8 pt
0
2
false
fals...
阅读全文
Pros: simple REST
API which can be directly accessed via Javascript or virtually any server-side
language; offers up to 1,000 results in chunks of up to 50 at a time; offers
results in at least three different formats: xml, JSON, and serialized php.
Cons: no access to
Google search services, only Yahoo; only offers up to 1,000 results in chunks
of up to 50 at a time; per-IP address rate limit of 5,000 queries/24-hour
period; absolutely no UI elements. It brings difficulty in programming, such as
if people want to build a client-side approach, they have to build it from
scratch almost. People are expected to route any links through Yahoo’s servers
so they can track traffic.
Figure3.9 Yahoo Web Search API
Java snippet
Figure3.9 is an
implementation of Yahoo web search API, with typical address in red box. While
applying Yahoo news search API, the address in red square can be changed into
String
se =
"http://search.yahooapis.com/NewsSearchService/V1/newsSearch?appid="+uid+"&query="+query+"&results="+total+"&language=en";
Both Google Ajax
and Google Base Data API have restriction on the number of results and the
client application does not have a free control like manually browsing the web
page. It appears that Google is checking the User-Agent header
to make sure that people are doing this from a browser that it knows about, otherwise
it will deny people’s access (i.e. they don't want people doing this from an
application unless they use their API service such as section 3.2.1
and section 3.2.2. It blocks the client’s java program access).
Now, one method is found by simply changing some parts of java code. By
following the piece of code in the red box can solve this problem.
Figure3.8
Google Base Data
API is another way for programmers who want to write client applications that
interact with Google Base. With the query like this:
http://base.google.com/base/feeds/snippets?bq=query&key=xxxxxxx
The value of “bq”
is the input query and the value of “key” is a unique ID provided by Google after
registered in Google. The result is returned in clean XML format. However,
there are still some limits, with the same query, Google Base Data API returns
different results compared to its original web interface.
Google’s SOAP Search
API would have the ability to access Google’s results, but it has been
deprecated since late 2006 and no new license keys have been distributed. Instead
of SOAP API, Google released its new Ajax
for researchers and the search query should like this:
http://ajax.googleapis.com/ajax/services/search/web?start=1&rsz=large&v=1.0&q=“query”
Figure3.6 Google Ajax code snippet
The
result is returned in JSON format which can be parsed by JSON library from http://www.json.org/java/. By changing the “start=”
value, the offset is decided easily. However, using new API Google Ajex, the
maximum number of result URL returned from Google is limited. In 2008, the
maximum number is 32. Currently, the maximum number is 64.
As 2 of the most powerful
search engines, Google and Yahoo have the strongest abilities in searching the
surface web and they also provide all kinds of different special search
functions such as web search, news search, image search which are familiar by
the people all over the world. A general experiment in exploring the search
abilities without testing these 2 search engines is certainly not conclusive.
Meanwhile, the
detailed search ability test implementation on Google, Yahoo or other search engine
needs programming according to their result pages, specifically speaking, they
are shown in different HTML templates. For example, Figure3.3 and Figure3.5 are the returned pages from Google and
Yahoo, with the same query “job search”, currently, the differences or quality
are not compared in this section. Although the 2 result pages Figure3.3 and Figure3.5 show a very similar format such as the
search engine input interface at the top, the main content is in left and takes
more than 2/3 spaces and leaving the right 1/3 to commercial websites as advertisements,
the HTML behind the pages show quite different grammar and make extracting the
result title, result summary, result URL not a single general template, but
specifically one extracting algorithm for one search engine.
Here is a segment
from Google result page, the texts shown in Figure3.3 are also in the red boxes in Figure3.2.
Figure3.2
Figure3.3
Here is one
segment from Yahoo web search result page.
Figure3.4
Figure3.4 is a segment from Yahoo result page and
the text shown in Figure3.5 is also in the red square in Figure3.4.
Before parsing the
HTML in Figure3.2 and Figure3.4, both Google and Yahoo provide easier ways
for the developers to parse and extract results information. It is quite
necessary to carefully examine their API first. Meanwhile different kinds of
API provided by Google and Yahoo show quite different capabilities in returning
the result links.
Figure3.5
The project is
designed to find a best query (LS query) which can represent the web page and be
searched by search engine easily. The web pages are picked up from the surface
web which is easily accessed without any authority restrictions. The
experiments are arranged in following general steps:
(1) Find a large amount of web pages which will
be used for LS extraction and re-finding/relocation, download them as test
pages into local disk and save their URLs at the same time. The source of web
pages used for experiments requires careful considerations as all kinds of web
sites showing up today, some of them are ill-formatted or have poor information.
Obviously, they are not the ideal sources in experiments. The detail of web
pages selection and crawling is non-trivial and will be introduced in section 3.3.1
specifically.
(2) HTML parsing and text extraction are needed
before extracting LSs. The preprocessing reasons and steps will be in section 3.3.2.
(3) Apply the algorithms in chapter 2 which are
designed to extract lexical signatures from these downloaded pages. In chapter 4,
TF, DF, TFIDF, TF3DF2, TF4DF1, TFIDF3DF2, TFIDF4DF1 as well as word rank and sentence
rank will be applied to the pages after step2. But different from S. T. Park’s
experiments, in this project, varying term numbers in a query is accepted,
which make such as TF3DF2 and TFIDF3DF2 to and , the ratio in selecting between DF order terms and TF or TFIDF
order terms is not changed. For convenience, the selections on TF, DF and their
mixed forms are listed from (a) to (h) which have been precisely described in
S. T. Park’s paper. Although the topic of how many terms are going to help
getting better searching results is studied and unveiled by Martin and Michael,
they claim that 5 terms a query is the most efficient number in getting the
desired result appeared in top 10 from SE [3]. More terms in a query
means obviously more feasible in sentence-rank. Meanwhile, Google raised their
web search limit to 32 words 4 years ago back to 2005. In this project, up to 15
words are allowed per query, with succeeding words being ignored. The
performance with different number of terms in a query is an interesting topic
and they will be explored in chapter 4.
a)
TF: Select
terms in decreasing term frequency (TF) order. If there is a tie, then pick terms
based on increasing document frequency (DF). If there is another tie, randomly
select the terms [5].
b)
DF: Select
terms in increasing DF order. If there is a tie, then pick terms based on
decreasing TF order. If there is another tie, randomly select the terms [5].
c)
TFIDF:
Select terms in decreasing term-frequency inverse-document frequency (TFIDF)
order. If there is a tie, then pick terms based on increasing DF order. If
there is another tie, randomly select the terms [5].
d)
PW: Select
terms based on Phelps and Wilensky’s [3] method: First, list terms
in TF order and then pick LS in decreasing TFIDF order (i.e., decreasing TFIDF
order where the TF term is capped at five). If there is a tie, then pick terms
based on increasing DF order. If there is another tie, randomly select the terms
[5].
e)
TF3DF2:
Select two terms in increasing DF order. If there is a tie, then pick terms
based on decreasing TF order. If there is another tie, randomly select the terms.
Then filter out all terms which have DF value 1. Select three terms maximizing
TF. If there is a tie, it is resolved the same way as with TF method [5].
f)
TF4DF1:
Select one term based on increasing DF order first. Then filter out all terms
which have DF value 1. Select four terms maximizing TF [5].
g)
TFIDF3DF2:
Select two terms based on increasing DF order first. Then filter out all terms
which have DF value 1. Select three terms maximizing TFIDF [5].
h)
TFIDF4DF1:
Select one term based on increasing DF order first. Then filter out all terms
which have DF value 1. Select four terms maximizing TFIDF [5].
(4) Select several search engines which support
developer mode. Google web search and news search, Yahoo web search and news
search are picked as general search engines in the test. More details are in
section 3.2.
(5) Download the result links in first page returned
by SE as well as their corresponding HTML pages. If there is a URL match
between test URL and result URL, a successful retrieval is recorded. However if
there is no match between test URL and first 10 result URLs, then, comparison
between the test page and result pages from search engine is done to see if they
have the same topic or content. It is similar to step 2 because both of them
need HTML page parsing, extraction and converting them to understandable pure
text without noises from HTML tags. Also, the criterion of considering whether
2 pages are in the same topic or have similar content is another important issue
in this project. It has been introduced in the beginning of section 2.5. The
details will be discussed in section 3.4.
(6) Repeat step 2 to step 5 on all test pages
downloaded by step 1, count all the success retrievals and compute a success
retrieval rate which is in 3.1. This is the most straightforward measurement
which is also adopted in both S. T. Park and Xiaojun Wang’s researches. Martin
and Michael’s LS score is also a good way to evaluate the query generation,
however, their score is not as straightforward as 3.1
3.1
(7) A stable and reliable LS extraction
algorithm should help the success retrieval rate be higher than 90% as the ultimate
requirement. The traditional ways which do not maintain the linguistic meaning
such as TF shows quite different results comparing to graph-based rank
algorithms which maintain the linguistic meaning. For example, after step 3,
all terms are listed in alphabetic order and query’s terms are picked from a to
z. In chapter 4, the results from Google and Yahoo show both of those search
engines have a strong ability in returning desired pages by meaningful query
rather than discrete terms simply ordered from TF, IDF and so on, even some
stop words are included. The results and analysis will be in chapter 4.
Figure3.1
Figure3.1 is the flow chart showing the overall
procedure in experiments including from step 1 to step 7. The URL in the left
top rectangle of Figure3.1 is a directory built with a large amount
of effective URL which can be accessed online. The corresponding pages are then
downloaded and saved on local disk.
The algorithms in
extracting lexical signatures from a page are in 3 rectangles in the right top
of Figure3.1. They are the core in this project.
Another key programming issue is 2 rectangles in button of Figure3.1 page content comparison and results
record. The middle rectangles in Figure3.1’s left and right are a pipeline showing
all the operations.
Similar to Xiaojun
Wang’s applying word rank on web pages, this project applies sentence rank on
web page. The passages in the web pages can be extracted as shown in Figure2.19’s red square.
Figure2.19
However, there are
conditions sentence rank can not work. Some web pages may not have sentences or
passages, which make sentence rank on those pages not effective when there are only
titles or phrases. Take Figure2.20 as an example, there is not any passage in
Yahoo’s home page, and although there are several sentences in the center 3 red
squares which are shown in anchor tags separately, it brings difficulties to
construct connections among those independent sentences, because they actually
come from completely different topics. Meanwhile, there are a bunch of simple
words and phrases in the left blue squares, such as “answer”, “auto” and “finance”.
It brings challenges in combining the terms and sentences as well as applying
sentence rank. Therefore, the page like Figure2.20 is not a typical type can be applied by
sentence rank.
Figure2.20 A typical example of
link-based page
There is a simple
way to exclude the pages which are not suitable for sentence rank. A threshold
p is defined to separate the pages into 2 categories linked-based page and
content-based page after using formula 2-11.
2-11
The pages like Figure2.20 can be concluded as a linked-based page
which has a high portion with text in link. The linked-based pages are easily
found from the websites’ home page and index page. Compared to Figure2.19, a content-based page has high portion in
plain text without link such as Figure2.20.
In previous
section, a graph-based ranking algorithm on words is introduced. In this
section, the vertex in the graph is applied on the sentences. Compared to
word-rank, sentence rank preserves much more linguistic information and the
results from sentence rank are more understandable because now the LS is
composed by whole sentence rather than terms and phrases.
Similar to the
word rank graph, without considering the un-weighted graph, there are 2 kinds
of graph construction for sentences in a document: undirected weighted graph and directed
weighted graph. Moreover, Rada and Paul proposed another form of directed
weighted graph and they classify directed weighted graph into directed forward weighted graph and directed backward weighted graph [9]. In
undirected weighted graph, the
assumption is that every sentence in the document has connection with all other
sentences. In directed forward weighted graph, every sentence
points to all the following sentences in text, while every sentence receives all
the previous sentences in text. In directed
backward weighted graph, every sentence points to all the previous
sentences in text, while every sentence receives all the following sentences in
text.
Figure2.17
Figure2.18 shows a 5-sentences passage. Figure2.18 shows the 3 graphs in detail. In Figure2.18 (a), as an undirected graph, every
sentence has connection with other 4 sentences with a pair of in coming pointer
and out coming pointer. In Figure2.18 (b), as a directed forward graph, sentence
1 points to sentences 2, 3, 4, 5; sentence 2 points to sentences 3, 4, 5;
sentence 3 points to sentences 4, 5; sentence 4 points to sentence 5 and
sentence 5 does not have out coming pointer. As a directed backward graph, Figure2.18 (c) shows the (b)’s reversed order,
sentence 5 points to sentences 1, 2, 3, 4; sentence 4 points to sentences 1, 2,
3; sentence 3 points to sentences 1, 2; sentence 2 points to sentence 1 and
sentence 1 does not have out coming pointer.
a. undirected b.
directed forward c. directed backward
Figure2.18
Therefore, after the
direction being decided, the weight between the connections is about computing
the 2 sentences’ similarity, as in equation 2-10, the common terms shared by both sentences
are divided by the length of both sentences in log form. Length can be simply considered as the terms number in
sentence.
[10] 2-10
Generally, Wk
is a word appearing in both Si and Sj. Wk can
also be the words with same meaning but different forms such as “interested”
and “interesting” which could finally prove the ability in finding the similar
or related documents. If there is no common word Wk, then the edge
between Si and Sj can be removed.
After iterations
on 2-8 and 2-9, the sentences from text with the highest
rank are selected and taken as an abstraction. In [9], 4 highest ranked sentences
are extracted after the iteration, resulting in a summary of about 100 words. Rada
and Paul also evaluated the abstraction from this graph ranking algorithm with
other methods such as HITS and Positional Power Function (POS) by ROUGE
evaluation [15]. It turned out the sentence rank from page rank
provided good performance. The details are not included in this report.
After all, Rada and
Paul concluded that, based on this sentence rank algorithm, the abstraction
from the extracted sentences are more informative for the given text and it
does not require deep linguistic language nor domain or language specific
annotated corpora [10].
Xiaojun Wang
applied the word-rank algorithm to the LS extraction on web pages for finding
lost or related web pages [11]. He also compared the results from
this graph-based ranking algorithm with the traditional ways in extracting the
terms from documents, such as TF, DF, TFIDF, PW, TF3DF2, TF4DF1, TFIDF3DF2 and
TFIDF4DF1. He pointed out, with word rank algorithm, which takes the semantic
relation between words into account and chooses the most representative and
salient words as Lexical Signatures, the highly relevant web pages can be found
when the desired web pages can not be retrieved [11].
In their
experiments, Wang not only used the basic word-rank algorithm, but also
combined it with DF to select the terms. In [11], Wang only constructed
undirected weighted graph model G=(V, E), V is the vertex set containing all
words except stop words. E is a set of undirected and weighted edges. Each
vertex’s initial score is the normalized value and set the damping
factor d=0.85. Wang did not use window size but WordNet [14] to
recognize the semantically related words and Wang did not mention the value of
weight on edges. These 2 detailed implementations are definitely related to a
large amount of work but they were not listed in their paper “WordRank-Based
Lexical Signatures for Finding Lost or Related Web Pages” [11].
[11] 2-8
[11] 2-9
Similar to 2-7 and set the convergence threshold to
0.0001, 2-8 is run on the graph until it converges based
on 2-9.
In Wang’s
experiments, he selected Google and MSN Search, randomly crawled 2000 URLs from
domain DMOZ.org and kept 1337 pages after filtering out the unqualified pages,
such as too short in content, non-HTML format like .pdf, .ps and .doc. Wang
constructed each query with 5 terms by implementing TF, DF, TFIDF, PW, TF3DF2,
TF4DF1, TFIDF3DF2, TFIDF4DF1, WordRank, WordRank3DF2 and WordRank4DF1.
Including the unique returned by SE, top 1 and top 10, the average success rate
among 1337 pages are generally from 40% - 60%, for Google, except WordRank3Df2
which is a little higher than 60%. Meanwhile, the results from MSN show poor
performance in TF, which is lower than 30%, TFIDF is lower than 40%, the others
are between 40% and 60%.
Figure2.15 Retrieval performance of LS from Google search [11]
Figure2.16 Retrieval performance of LS from MSN live search [11]
In the
summarization, Wang concluded that DF was the best method for uniquely
identifying the desired documents; TF was easy to compute and did not need to
be updated unless documents were modified; TFIDF and the hybrid method
combining TFIDF and DF were good candidates for extracting the desired
documents [11]. By computing the average cosine similarity values of
top 10 returned pages with the desired page, WordRank based methods such as
WordRank3DF2 are best for retrieving highly relevant documents [11].
Word-rank is one
implementation of weighted graph ranking algorithm including undirected
weighted (on edges) graph and directed weighted (on edges) graph when a single
word/term is considered as a vertex and all content is a graph. A window size
parameter ‘w’ is introduced for
implementing connection among vertices. In undirected weighted graph, each word
has connection with other words only in the window size distance, including previous
w words and following w words. In directed weighted graph,
each word has connection with the following words only in the window size
distance. Take Figure2.14 as an example and set window size to 2, ‘packing’
has connections with ‘ferocious’, ‘storm’, ‘freezing’ and ‘rain’ in undirected
weighted graph, while it only has connections with ‘freezing’ and ‘rain’ in
directed weighted graph. The score associated with each vertex is set to an
initial value of 1 and ranking algorithm, 2-6 for undirected weighted graph and 2-7 for directed weighted graph, is run on graph repeatedly
until it converges – usually for 20-30 iterations, at a threshold of 0.0001 [9].
Figure2.14
The expected end
result for this application is a set of words or phrases that are
representative for a natural language text. The terms to be ranked are
therefore sequences of one or more lexical units extracted from text, and these
represent the vertices that are added to the text graph. If more than 1 term
happened to be neighbors, they can be connected as a key phrase. Thus, the
language consistency in the content is preserved. Rada and Paul in their paper “TextRank:
Bring Order into Texts” gave a clear view of this passage “Compatibility of
systems of linear constraints over the set of natural numbers. Criteria of
compatibility of a system of linear Diophantine equations, strict inequations,
and nonstrict inequations are considered. Upper bounds for components of a
minimal set of solutions and algorithms of construction of minimal generating
sets of solutions for all types of systems are given. These criteria and the
corresponding algorithms for constructing a minimal supporting set of solutions
can be used in solving all the considered types systems and systems of mixed
types.” and extracted the following terms as results: “linear constraints”, “linear
Diophantine equations”, “natural
numbers”, “nonstrict inequations”,
“strict inequations” and “uper bounds” [9].