微软从未放弃搜索引擎的竞争,一直和Google暗暗较劲。尽管live search在内部员工里像是一个joke,但老大一直毫不犹豫地往里砸钱。

说 实话,我尽量使用微软的产品,操作系统放弃了linux,开发工具放弃了perl和java,当然这些是工作使然。但map我以前用 MapQuest,现在改用live map,浏览器也弃Firefox改用IE8,但凡能用的,我都会改用微软的产品,不过对于搜索引擎,感觉实在太烂了,搜出来的东西总不是自己想要的,往 后翻了10来页也不见有用的。后来就偷偷把Google设为默认引擎。见到一个同事比我更过分,连outlook的搜索都改用Google Desktop来搜索。

后来,3月初的时候,内部就发布了一个新的搜索引擎,叫Kumo(酷摸?)。据说是因为live这个名字不好,不 信把它反过来念念看看是什么?我觉得只是一个名字的更换没有什么意义。后来还是忍不住上去试了试,发现确实比原来的那个好一些。没事的时候也会用Kumo 摸一把。

今天,鲍老大又宣布发布一个新的搜索引擎,叫Bing。感觉怎样?我怎么读的像有病的‘病’?还不叫Search Engine,改叫Decision Engine,够新潮的概念。我不太清楚为什么取这样一个名字(据鲍老大说,是因为它短小好记),不过从一个日文名字变成一个中文名字,我感觉这是陆奇上台登 上Search老大交椅之后的一个成功。记得前两天Search主页的封面就开始用上内部某员工拍的中国阳朔的风景照片。不管猜测对不对,新的搜索引擎还 是要试一试,结果有好事之徒一上来就搜了个“六四”,结果出来的全是大学四六级考试,让人有些瀑布寒。还没有公开release,公关就已经做得这么好 了。

让人更囧的是,为庆祝新的release,search组的人每人发了一件T-shirt。据说前面是"I Bing",后面是“U Bing”。听起来像“我有病,你也有病”。不过Search组的人并以为然,因为他们为“Bing”取了一个中文名字叫“必应”。比“谷歌”好一点么?

其他组的好事之徒可没那么友好,测试了一段时间之后,把这个“bing”的搜索引擎亲切地叫做Mr. Bean。

先前有说过,“很 多的软件做成web-based是web3.0的一个趋势”。从技术角度上说,这些web-based的应用程序和以前装在本地硬盘的软件有些不一样,确 切地可以理解那些具有服务功能的网站或者应用程序为能够浏览器所容纳的对象,而浏览器只是一个可以支持多种对象的容器,可对象的后台的服务应用程序正是 deploy在各种web服务器上的软件。



其 实,说很多的软件做成web-based就是变成一个个可以为浏览器所接纳的对象模型只概括了其中的一部分。它只是说到软件的表现形式。这很容易让大家忽 略数据的存储形式,而默认这样的web-based的服务让我们更多的是享受网络上的数据或者搜索引擎上的数据。我们不用经常下载软件占据自己的硬盘,有 了网络电视,我们也不用下载电影,甚至也无需下载音乐。我们自己的数据比如email,blog,订阅的杂志,收藏的信息也都存放在各个网站的服务器上, 而无需下载下来。

我们似乎已经习惯了在线的状态。淡忘了脱机的那个年代。而一向标新立异的Google似乎又找到回归的需求,那就是最近推出的的Google Gears。它提供人们一个浏览器的插件,通过这个插件我们下载数据到本地硬盘,并且提供一个小型数据库引擎(SQLite)在本地硬盘帮助存储,建立索引和搜索数据。另外提供接口实现后台的数据同步而无需占用浏览器资源。

目前Google Gears的API应用在Google Reader上,即用户可以下载订阅的电子杂志到本地硬盘,方便整理和收藏。


Got a question, when I apply sort command line in linux to sort some domain names by dictionary order, no matter which option i used, it will sort some domains like this:


I am curious what comparison function it applys in its' sorting function. I supposed it should be a string comparison, like strcmp function, but it is not. coz strcmp will compare ascii code of characters in string one by one, thus above sorting should like this:

one guess is that when sorting names the special characters like "." "-" will be skipped. but still got some problem when sorting following names:

why can linux sorting keep this order? if it skips some special characters, above names should be compared equally and maybe sorted as a random order.

confused, anybody has thought about that?


Haven't got updated here for quite a long time, coz I am back to program with c under linux and I believe it is a place for Java programmers.



Linux sorting compares unicode of strings … more about unicode is here

随着网络上信息量的日益增加,人们的学习和工作越来越离不开网络搜索引擎(有些生活中的小例子在《Google 今天8岁》文中有提到)。




最近,Google推出的产品 custom search service 则适应了这一需要。







当然,你需要一个Google 的账号(没有也没有关系,只需要用你们的email注册一个就可以了,很简单)
这样,你就可以成为这个搜索引擎的一员了,平时,你觉得那个网站很好,里面的信息量也比较大,你可以把这个网站添加到Blog Digger的网站列表中。也可以为你感兴趣的一些搜索添加搜索条目。

Not sure if it is a bug of (Http)URLConnection, but it hang sometimes for some URLs while calling any functions to get information from connection (includes getResponseCode, getInputStream, getContent, getContentLength, getHeaderField blabla..) after connection has been built (even I have set the read timeout and connect time out).

the functions openConnection() and connect() are ok, curious about that problem.

anybody has the same problem or similar problem with URLConnection?
Ajax (AsynchronousJavaScript and XML)是近年来流行的一门web 技术。在Blogjava上看到有人开始在介绍AJAX,但仿佛流于概念或理论的东西,对于想用Ajax的初学者似乎不是很make sense。我想,学习任何一样新的技术,例子和步骤是极为make sense的两样东西。


0. 导读

2。使用Ajax的基本步骤。(简单例子--> Demo)
再来一个例子(Google Suggest)。(Demo)
    4。家庭作业 :)


在笔者看来,Ajax更像是一个简单的网络框架,它描述着如何高效地使网络前端的数据展现和网络后端的数据之间的交互。基本上,就是浏览器提供一个XMLHttpRequest(当然在IE里是ActiveXObject)的对象向后台端的脚本程序或者Servlet Classes发送http请求,从后台端的回应中获取文本数据(如xml格式和最近有人讨论的Json格式)并嵌入前台段的网页中或脚本中。




第一步:Form 代码:接受前台端的输入,并通过Action方法(方法函数里包含创建XMLHttpRequest对象)把request post到后台端。

<input id="username" name="username" type="text"
  onblur="checkName(this.value,'')" />
<span class="hidden" id="nameCheckFailed">
  This name is in use, please try another.

<script language="javascript">
function checkName(input, response)
  if (response != ''){
    // Response mode
    message   = document.getElementById('nameCheckFailed');
    if (response == '1'){
      message.className = 'error';
      message.className = 'hidden';
    // Input mode
    url  = 'http://localhost/xml/checkUserName.php?q=' + input;

var req;

function loadXMLDoc(url)
    // branch for native XMLHttpRequest object
    if (window.XMLHttpRequest) {
        req = new XMLHttpRequest();
        req.onreadystatechange = processReqChange;"GET", url, true);
    // branch for IE/Windows ActiveX version
    } else if (window.ActiveXObject) {
        req = new ActiveXObject("Microsoft.XMLHTTP");
        if (req) {
            req.onreadystatechange = processReqChange;
  "GET", url, true);

1。 这里的form只是一个input box,action的方法是onblur,就是响应失去焦点的事件,然后调用一个函数checkName, 这个函数里通过XMLHttpRequest向PHP server script 发送Post请求(看得出来,这里的php server script的文件名叫checkUserName.php,唯一参数是q)。
        var req;
        function foo()
            req = false;

            // branch for native XMLHttpRequest object
                    req = new XMLHttpRequest();
                    req = false;
            else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
                    req = new ActiveXObject("Msxml2.XMLHTTP");
                        req = new ActiveXObject("Microsoft.XMLHTTP");
                        req = false;
//do something here


第二步:响应文本处理代码:XMLHttpRequest对象里有个类似消息响应函数的属性,即通过设置 req.onreadystatechange 来告诉XMLHttpRequest在哪个函数里处理服务端返回的文本信息。
req.onreadystatechange = processReqChange;
function processReqChange() 
// only if req shows "complete"
if (req.readyState == 4) {
// only if "OK"
if (req.status == 200)
// ...processing statements go here...
} else {
alert("There was a problem retrieving
the XML data:\n" + req.statusText);

function processResponse()
response = req.responseXML.documentElement;
method = response.getElementsByTagName('method')[0];
result = response.getElementsByTagName('result')[0];
eval(method + '(\'\', result)');

1。 基本上processReqChange 函数是标准代码的写法。

第三步:后台端代码(这个例子是php server script):接受前台端的请求,处理其参数,并返回相应的结果。

文件名: checkUserName.php

header('Content-Type: text/xml');

function nameInUse($q)

  if (isset($q)){
      case  'drew' :
          return '1';
      case  'fred' :
          return '1';
          return '0';
    return '0';
<?php echo '<?xml version="1.0" encoding="UTF-8"  standalone="yes"?>'; ?>
    echo nameInUse($_GET['q']) ?>



这里再讲一个实用的例子,这是以前上课的一个课堂作业,也很有代表性。是关于Google Suggest(好像新的Google Toolbar上就用的这个功能)的应用问题。这里是写好的DEMO。现在越来越多的网站提供类似Web Service的API, 我们利用他们提供的API URL可以返回一些我们用的着的数据,放在我们的网页上。这里就用的上Ajax。只不过有些返回来的文本数据是xml格式的,就可以利用上面的简单例子来处理,但很多像Google Suggest那样是返回一段类似代码格式的文本。我们就要利用Javascript的eval函数,把这些文本当作一段代码在嵌入自己的网页中。如果嵌入的代码中含有函数,则需要自己再写一个同名的函数作为实现。(这就是流程图中的optional的func 3)

这里完整代码就不贴了,贴一些关键代码(原本后台端是用Java Servlet写的,但做demo的空间没有Tomcat不支持Servlet,所以改用Php实现,大家可以自己用Java再写一边作为家庭作业 :) ):

1) form 代码:

<form name = "QForm" method="POST" action="google_suggest.php">
    <table bgcolor="8080C0" width="90%" >
        <td  nowrap>Search Term:</td>
        <td ><input type="text" name="qtext"  onkeyup="return GetSuggestion()" size="60"></td>
        <th colspan="2" align="left" bgcolor="#A8A8FF"><DIV id=google_suggest_target>results go here . . . </DIV></th>

a. 看得出来,要把查询的字符串post到google_suggest.php上
b. action的函数是GetSuggestion(),其返回的字符串会显示在预留的网页空间里。

2) 后台端代码(PHP):这里主要接收前台的请求,并不请求转化为向Google Suggest的API URL请求,把接收到的文本信息返回给前端。代码很简单,如下:


function getGoogleSuggest($q)

    $url = "" . $q;
    return file_get_contents($url);

<?php echo getGoogleSuggest($_POST['q']) ?>

a。 Google Suggest API 返回的是一个代码格式的文本信息,如下:
sendRPCDone(frameElement, "", new Array(), new Array(), new Array(""));

3) 前台文本处理代码:

    <script type="text/javascript">
        var req;
        function GetSuggestion()
            req = false;
            var f = document.QForm;

            // branch for native XMLHttpRequest object
                    req = new XMLHttpRequest();
                    req = false;
            else if(window.ActiveXObject) // branch for IE/Windows ActiveX version
                    req = new ActiveXObject("Msxml2.XMLHTTP");
                        req = new ActiveXObject("Microsoft.XMLHTTP");
                        req = false;
                var url = "google_suggest.php";
                req.onreadystatechange = processReqChange;
      "POST", url, true);

                req.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
                req.setRequestHeader("Method", "POST " + url + " HTTP/1.1");
                req.send("q=" + escape(document.QForm.qtext.value));

        function processReqChange()
            if(req.readyState == 4) // only if req shows "loaded"
                         if (req.status == 200) // only if "OK"
                                 x = req.responseText;
                                 alert("There was a problem retrieving the XML data:\n" + req.statusText);
            else if(req.readyState == 2)
        function sendRPCDone(frameElement, qString, arr1, arr2, arr3)
            var suggest_results = eval(arr1);
            var counts = eval(arr2);
            var htmlstr = "<TABLE cellspacing=4 border=0>";
            for (var i=0; i < suggest_results.length; i++)
                htmlstr += "<tr><td><a href=\"javascript:self.location=\'" + suggest_results[i] + "&btnG=Google+Search\'\">" + suggest_results[i] + "</a></td>";
                htmlstr += "<TD width=200><font color= 228b22>" + counts[i] + "</font></TD></TR>"
            htmlstr += "</TABLE>";
            document.getElementById("google_suggest_target").innerHTML = htmlstr;

4。家庭作业 :)

2。API URL 是 "" + "你的账号名";自己参看一下,看返回什么样的格式文本。另外,如果要限制返回的记录数,可以加"?count=10"这样的参数。



help doc:

1- Make sure u have Installed Apache 2 & PHP 5 and Java J2EE 1.5
2- download and, which will include
extra dll(s)
   - unpack pecl pkg to your extensions folder, in PHP5 its ext.
   - unpack java-Bridge to root php folder, in my case its simply C:\PHP
1. the java-Bridge inculdes new versions of certain files like php_java.dll
   so, it would be wise to rename your old files that came with PECL pkg for example
   file_old, to rollback at anytime.
2. Don't run batch file under php-java-bridge after unpacking to php root folder, just add following lines in php.ini configure file (depends on installation fold of j2ee):

extension_dir = "C:\php\ext"
java.java_home=C:\Program Files\Java\jre1.5.0_06\Program Files\Java\jre1.5.0_06\bin\javaw.exe

8 年的时间,Google从一个单一的搜索产品已经衍生出各种改变或影响人们生活的产品,并不断推动网络概念和技术上的变革。比如我们经常用的产品有 Google talk, Google Adsence, Google Gmail, Google Calendar, Google Map, Google Video, Google Store, Google Earth,Google toolbar, Google Desktop. 还有很多Google正在思考的产品




抓虾就是这样一个曾经让我有几许失落的感觉。失落得我有很长一段时间没有注册一个用户。不过收拾收拾自己的心情,我还是很欣然的接受这样一个优秀的国产web 2.0网站。

其实抓虾的idea很简单。它是一个把web 2.0概念和目前风行的基于RSS信息标准聚合格式很好地结合在一起的新兴国产订阅网站。尽管国外很早就有像Bloglines这样的在线RSS信息订阅网站。但不如抓虾把web 2.0的概念有机地结合在一起。前者只是一个简单的订阅系统和简单的共享。

关于web 2.0这个从上次网络泡沫的废墟上站起来的概念,目前大都的网民都有亲密接触。2005开始在国内流行至今的Blog和wiki其实就是web 2.0产物中的代表。


然而,web 2.0的概念就是给网民提供一个享受各种web服务的平台。

网民不再是观众,而可以是演员,导演,发行商,甚至二贩子。从技术角度上讲,web 2.0使用户开始可以控制数据。从用户角度讲,web 2.0使Internet成为一个虚拟社区,大家可以相互交流和共享。(从这种意义上说,早期的BBS和P2P下载软件都是web2.0)

关于RSS聚合,我一直认为它只是一个基于xml的数据结构。在很早以前开始用.Net开发的时候,我就接受xml schema的一个思路,就是实现数据与其表现形式相分离。这也是我克服想嘲笑xml这样一个如此简单的网络标准的冲动。不过那时,我就有用RSS作为 Internet上凌乱不堪的信息的一个标准结构的想法,这样搜索引擎就会变得简单(也曾经为此写过一个类似资料收集器的小程序)。尤其在选了一门 Distributed Multimedia Information Management的课程后。里面大谈网络的Ontology和RDF技术。其实也就是用xml的数据结构去描述网络实体及其内在联系的一种技术。不过,rdf相对于简单的rss来说,在应用上似乎超前一些。

有了web 2.0的概念,有了标准的数据结构,再加上一些具体的网站实现技术(比如目前流行的Ruby),你就可以自己捣鼓一个web 2.0的网站出来。抓虾很显然在这方面做的比较成功。一方面,国内这方面成功的网站还比较少(经常去的也就是抓虾和豆瓣),另一方面,目前rss(如 blog)正在国内大肆流行的季节。

当然现在不少web 2.0没戏的论调。其实这没什么新鲜。网络的东西就是这样,每个人都有idea,都可以有技术做,但要存活做大,就这能是少数。web 2.0目前还是烧钱阶段,因为提供的服务都是免费的(大家已经习惯了网络的免费午餐),只能烧钱抢用户,最后卖流量,再搞垄断。如果没钱,就只能做成像奇客发现(这样子(这个网站的idea和著名的digg.com类似,但显然还在incubation阶段)。这一点,和web 1.0没有什么区别。这也是为什么大都的IT人依然郁闷,生活在各大小不等的目前还存活的公司的庇护下做着自己各自的梦想。

When I try to get some information of http connection to some websites (say by function HttpURLConnection.getResponseCode(), it seems tthat JVM hangs for quite a while. Somebody says that is maybe the problem of http server who must be a Microsoft webserver. Here and here are the bug report information for Java 1.3 or before. Though it is said that the problem has been solved after java 1.4, i still get undesirable a long time waiting before a SocketException (Connection reset) is thrown out. Btw, conn.setConnectTimeout or conn.setConnectTimeout is involved for this problem. I am not sure if there is any method that can save time to skip those bad links.
Here is a good article to introduce Ruby ..why we choose Ruby instead of Perl and Python ?
-- Scenario:
    The purpose of a reader is to interpret a low-level byte stream (ByteArrayInputStream, StringInputStream, FileInputStream and so on) as a character stream and provid character input to whatever class needs it. And it is very simple to convert an inputstream to a reader:
Reader reader = new InputStreamReader( in ); //in is an instance of class InputStream or derived classes
But the issue is sometimes we need convert a reader to inputstream, think about following scenaros:
1.  the original inputstream has been filtered by certian reader, now we need save back filtered content into database by inputstream: we can not use original inputstream but filtered stream which can only get from your reader.
2.  Given a class who contains a reader to access a streaming content after complex parsering or downloading, we want to utilize the streaming content in this class while not repeating complex operations for content analysis, so we need employ some wrapper methods to get inputstream from reader.

-- Solution:
1. write own InputStream implementation, such as following:

class MyInputStream extends InputStream
private Reader rd;
public MyInputStream(Reader rd)
this.rd = rd;
// implement the read() method to make this all work
publicint read()
int t =;
// you can do your processing on the inputReader here
// fiddle with the values and return
return t;
Note: Applications that need to define a subclass of InputStream must always provide a method that returns the next byte of input.
(refer to

-- anything else? BTW, for parsering xml-based input stream by SAX, I am glad to see that the inputSource constructor can take either InputStream or Reader (refer to

for general purpose hash function:

for cryptography & hash function

for a faster and better hash function (comparison of several hash function):

----> for further reading...
1. Getting the IP Address of a Hostname

InetAddress addr = InetAddress.getByName("");
byte[] ipAddr = addr.getAddress();

// Convert to dot representation
String ipAddrStr = "";
for (int i=0; i<ipAddr.length; i++) {
if (i > 0) {
ipAddrStr += ".";
ipAddrStr += ipAddr[i]&0xFF;
catch (UnknownHostException e) {

2. Getting the Hostname of an IP Address

This example attempts to retrieve the hostname for an IP address. Note that getHostName() may not succeed, in which case it simply returns the IP address.

try {
// Get hostname by textual representation of IP address
InetAddress addr = InetAddress.getByName("");

// Get hostname by a byte array containing the IP address
byte[] ipAddr = new byte[]{127, 0, 0, 1};
addr = InetAddress.getByAddress(ipAddr);

// Get the host name
String hostname = addr.getHostName();

// Get canonical host name
String hostnameCanonical = addr.getCanonicalHostName();
} catch (UnknownHostException e) {

3. Getting the IP Address and Hostname of the Local Machine

    try {
InetAddress addr = InetAddress.getLocalHost();

// Get IP Address
byte[] ipAddr = addr.getAddress();

// Get hostname
String hostname = addr.getHostName();
} catch (UnknownHostException e) {

In the last digest about Greatest software ever written, I noted a worm named Morris which is ranked 12 of greatest software by the author. Actually, after finishing my clustering searching enigne development which is based on Lucene, i am studying p2p architecture for my distributed searching engine (more precisely is webcrawler part). When I am reading some p2p loopup protocol papers such as Chord, I also noticed a guy named Morris who is one of the developers. Hmmm,  this is the same Morris, from wiki, I know that guys is now an associate professor in MIT, and was indicted because of the damage by his Morris worm. Anyway, I'd like to say that it is very interesting to know some stories about those geeks.
12. The Morris worm
11. Google search rank
10. Apollo guidance system
9. Excel spreadsheet
8. Macintosh OS
7. Sabre system
6. Mosaic browser
5. Java language
4. IBM System 360 OS
gene-sequencing software at the Institute for Genomic Research
2. IBM's System R
1. Unix System III

How r u thinking?
大伙都知道,Google是运行在很多的Linux(GNU)系统的服务器上的,而这只是它支持免费软件的一个方面。其他的比如,Summer of Code, 现在已成为一个生产很多优秀代码和项目的孵化基地,并且最近开放的Code Repository, 大有取代笔者注:广大开源的据点)之趋势。一方面,Google贡献出它的Picasa(Linux(GNU)平台)(笔者注:一个图片管理软件),并被Wine(笔者注:Linux/Unix上的Windows,建于x-window之上)所使用;另一方面,Google也赞助一些开源项目,如Sri Lanka,大概有$25,000之多。
当然,Google也会秘密地进行一些开源的资助。比如,令我们大伙惊讶的Mozilla Foundation(笔者注:大家熟悉的另一浏览器Firefox)居然在去年有赚到72个million -- 就是在Firefox上把Google的搜索引擎作为缺省的搜索引擎。

2005年的1月份,Google把Ben Goodger招为靡下。此人乃Firefox的首席工程师,并且是几个主要开源编码者之一。到了年末,Guido van Rossum, Python的始创人,也加入了Google。最近,Linux2.6核心的维护人,Andrew Morton也宣称即将离开OSDL并投奔到Google.


记得在最初的那些年代里,人们都为着自己的兴趣爱好在业余时间里一边工作一边学习地奋力地写着自己的代码。突然,第一个.com的时代来临,不少早期的开源公司开始聘请顶级程序员:如核心编码员Alan Cox, David Miller,Stephen Tweedie等人纷纷来到Red Hat, 还有一些去了Linuxcare。





那些关于用了开源的代码的公司是不是也应该开放他们的代码的争论不仅仅涉及到Google。其他的一些主要得益者如Yahoo, 其最近正活跃于收购一些Web 2.0的公司如,这些都很显然有着开源的印记,当然它没有Google那样与开源的关系那么源远流长,不过Yahoo也开始着手吸引开源人才。
People are still talking about web 2.0, I am not sure that is pure technical term. In my understanding, maybe most of meaning of web 2.0 is its marketing meaning. that is, web is becoming commonality and people generate the web's content. Again, i am not sure what is the place of web service in web 2.0, in my understanding, the web is not merely client-server marketing model (I am not talking web structure here), but an interactive community. But question is , who gonna be the operator or administrator of this community or if there are any game rules needed to follow ? will that be another utopian ?

Well, on a technical layer, I'd like to shed some lights on so-called web standard trends

1. front end --
         CSS ----> layout
         XML ----> data 
         XHTML ----> markup
         Javascript & DOM ----> behavior + XMLHttpRequest --> AJAX ?

2. back end -- 
         some open source projects such as Ruby on Rail...

let me know how you are thinking...

作为LuceneNutch两大Apach Open Source Project的始创人(其实还有Lucy, Lucene4C 和Hadoop等相关子项目),Doug Cutting 一直为搜索引擎的开发人员所关注。他终于在为Yahoo以Contractor的身份工作4年后,于今年正式以Employee的身份加入Yahoo

下面是笔者在工作之余,翻译其一篇2年前的访谈录,原文(Doug Cutting Interview)在网上Google一下就容易找到。希望对搜索引擎开发的初学者起到一个抛砖引玉的效果。



我主要在家从事两个与搜索有关的开源项目的开发: Lucene和Nutch. 钱主要来自于一些与这些项目相关的一些合同中。目前Yahoo! Labs 有一部分赞助在Nutch上。这两个项目还有一些其他的短期合同 。






 -- 攫取(fetching):就是把被指向的网页下载下来。
 -- 数据库:保存攫取的网页信息,比如那些网页已经被攫取,什么时候被攫取的以及他们又有哪些链接的网页等等。
 -- 链接分析:对刚才数据库的信息进行分析,给每个网页加上一些权值(比如PageRank,WebRank什么的),以便对每个网页的重要性有所估计。不过,在我看来,索引那些网页标记(Anchor)里面的内容更为重要。(这也是为什么诸如Google Bombing如此高效的原因)
 -- 索引(Indexing): 就是对攫取的网页内容,以及链入链接,链接分析权值等信息进行索引以便迅速查询。
 -- 搜索(Searching): 就是通过一个索引进行查询然后按照网页排名显示。



很不幸,估计他们大都没戏。因为Nutch还是需要一个Java servlet的容器(笔者注:比如Tomcat)。而这个有些ISP支持,但大都不支持。(笔者注: 只有对Apache服务器有掌控权,你才能在上面安装一个Tomcat之类的东东)

5。我可以把Lucene和Google Web API结合起来吗?或者和其他的一些我先前写过的应用程序结合起来?

有那么一帮人已经为Nutch写了一些类似Google的API, 但还没有一个融入现在的系统。估计不久的将来就行了。

6。你认为目前实现一个搜索引擎最大的障碍在哪里?是硬件,存储障碍还是排名算法?还有,你能不能告诉我大概需要多大的空间搜索引擎才能正常工作,就说我只想写一个针对搜索成千上百万的RSS feeds的一个搜索引擎吧。

Nutch大概一个网页总共需要10kb的空间吧。Rss feeds的网页一般都比较小(笔者注: Rss feeds都是基于xml的文本网页,所以不会很大),所以应该更好处理吧。当然Nutch目前还没有针对RSS的支持。(笔者注:实际上,API里面有针对RSS的数据结构和解析)

7。从Yahoo! Labs拿到资金容易吗?哪些人可以申请?你又要为之做出些什么作为回报?



我和那边的一些家伙谈过,包括Larry Page(笔者注: Google两个创始人之一)。他们都很愿意提供一些帮助,但是他们也无法找到一种不会帮助到他们竞争对手的合适方式。













具体的在这篇文章 micro/mi2003/ m2022.pdf)中有所描述。


这个,我们还没有腾出时间做这块。不过,很显然这是一个很重要的领域。在我们进入链接场之前,我们需要做一些简单的事情:察看词汇填充(Word stuffing)(笔者注:就是在网页里嵌入一些特殊的词汇,并且出现很多的次,甚至上百次,有些是人眼看不到的,比如白板写白字等伎俩,这也是Spamdexing方法的一种),白板写白字(White-on-white text),等等。

我想在一般意义上来说(垃圾信息检测是其中的一个子问题),搜索质量的关键在于拥有一个对查询结果手工可靠评估的辅助措施。这样,我们可以训练一个排名算法从而产生更好的查询结果(垃圾信息的查询结果是一种坏的查询结果)。商业的搜索引擎往往会雇佣一些人进行可靠评估。Nutch也会这样做,但很显然我们不能只接受那些友情赞助的评估,因为那些垃圾信息制造者很容易会防止那些评估。因此我们需要一种手段去建立一套自愿评估者的信任体制。我认为一个平等评论系统(peer-review system),有点像Slashdot的karma系统, 应该在这里很有帮助。




posted @ 2006-08-02 06:07 Dedian 阅读(14444) | 评论 (199)编辑 收藏

--  Getting Ready to Use CVS

First set the variable CVSROOT to /class/`username`/cvsroot
[Or any other directory you wish]
[For csh/tcsh: setenv CVSROOT ~/cvsroot]
[For bash/ksh: CVSROOT=~/cvsroot;export CVSROOT]

Next run cvsinit. It will create this directory along with the subdirectory CVSROOT and put several files into CVSROOT.

-- How to put a project under CVS

A simple program consisting of multiple files is in /workspaces/project.

To put this program under cvs first

cd to /workspaces/project


cvs import -m "Sample Program" project sample start

CVS should respond with
N project/Makefile
N project/main.c
N project/bar.c
N project/foo.c

No conflicts created by this import

If your were importing your own program, you could now delete the original source.
(Of course, keeping a backup is always a good idea)

-- Basic CVS Usage

Now that you have added 'project' to your CVS repository, you will want to be able to modify the code.

To do this you want to check out the source. You will want to cd to your home directory before you do this.


cvs checkout project

CVS should respond with
cvs checkout: Updating project
U project/Makefile
U project/bar.c
U project/foo.c
U project/main.c

This creates the project directory in your home directory and puts the files: Makefile, bar.c, foo.c, and main.c into the directory along with a CVS directory which stores some information about the files.

You can now make changes to any of the files in the source tree.
Lets say you add a printf("DONE\n"); after the function call to bar()
[Or just cp /class/bfennema/project_other/main2.c to main.c]

Now you have to check in the new copy

cvs commit -m "Added a DONE message." main.c

CVS should respond with
Checking in main.c;
/class/'username'/cvsroot/project/main.c,v <-- main.c
new revision: 1.2; previous revision: 1.1

Note, the -m option lets you define the checking message on the command line. If you omit it you will be placed into an editor where you can type in the checking message.

-- Using CVS with Multiple Developers

To simulate multiple developers, first create a directory for your second developer.
Call it devel2 (Create it in your home directory).
Next check out another copy of project.
  • HINT: cvs checkout project
Next, in the devel2/project directory, add a printf("YOU\n"); after the printf("BAR\n");
[Or copy /class/bfennema/project_other/bar2.c to bar.c]

Next, check in bar.c as developer two.
  • HINT: cvs commit -m "Added a YOU" bar.c
Now, go back to the original developer directory.
[Probably /class/'username'/project]

Now look at bar.c. As you can see, the change made by developer one has no been integrated into your version. For that to happen you must

cvs update bar.c

CVS should respond with
U bar.c

Now look at bar.c. It should now be the same as developer two's.
Next, edit foo.c as the original developer and add printf("YOU\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo2.c to foo.c]

Then check in foo.c

  • HINT: cvs commit -m "Added YOU" foo.c
Next, cd back to developer two's directory.
Add printf("TOO\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo3.c to foo.c]

Now type

cvs status foo.c

CVS should respond with
File: foo.c             Status: Needs Merge

   Working revision: 'Some Date'
   Repository revision: 1.2     /class/'username'/cvsroot/project/foo.c,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)
The various status of a file are:
    The file is identical with the latest revision in the repository.
Locally Modified
    You have edited the file, and not yet committed your changes.
Needing Patch
    Someone else has committed a newer revision to the repository.
Needs Merge
    Someone else has committed a newer revision to the repository, and you have also made modifications to the file.

Therefore, this is telling use we need to merge our changes with the changes made by developer one. To do this

cvs update foo.c

CVS should respond with
RCS file: /class/'username'/cvsroot/project/foo.c,v
retrieving revision
retrieving revision 1.2
Merging differences between and 1.2 into foo.c
rcsmerge: warning: conflicts during merge
cvs update: conflicts found in foo.c
C foo.c

Since the changes we made to each version were so close together, we must manually adjust foo.c to look the way we want it to look. Looking at foo.c we see:
void foo()
<<<<<<< foo.c
>>>>>>> 1.2

We see that the text we added as developer one is between the ======= and the >>>>>>> 1.2.
The text we just added is between the ======= and the <<<<<<< foo.c

To fix this, move the printf("TOO\n");to after the printf("YOU\n");line and delete the additional lines the CVS inserted. [Or copy /class/bfennema/project_other/foo4.c to foo.c]
Next, commit foo.c

cvs commit -m "Added TOO" foo.c

Since you issued a cvs update command and integrated the changes made by developer one, the integrated changes are committed to the source tree.

-- Additional CVS Commands

To add a new file to a module:
  • Get a working copy of the module.
  • Create the new file inside your working copy.
  • use cvs add filename to tell CVS to version control the file.
  • use cvs commit filename to check in the file to the repository.

Removing files from a module:
  • Make sure you haven't made any uncommitted modifications to the file.
  • Remove the file from the working copy of the module. rm filename.
  • use cvs remove filename to tell CVS you want to delete the file.
  • use cvs commit filename to actually perform the removal from the repository.

For more information see the cvs man pages or the file in cvs-1.7/doc.

copy from
When reading GData source code, you will find that there are lots of generic-style code in it, which is one of several extensions of JDK 1.5. If you are using java 1.5 compiler, it is surely deserved to get some ideas about generic. Be noticed that Java generic looks like C++ Temple, but is quite different.

1. what is the idea of generic?
To simply say, generic is an idea of parameterizing type, including class type and other data types.

2. examples?
-- We are familar with some container types, such as Collection. Here is an example for our former (Java 1.4 or before) typical usage:
Vector myList = new Vector();
myList.add(new Integer(100));
Integer value = (Integer)myList.get(0);

now it is better to write like this for type safety: (Eclipse IDE will display type safety warnings for above code if under java 1.5 compiler option)
  Vector<Integer> myList = new Vector<Integer>();
  myList.add(new Integer(100));
  Integer value = myList.get(0);

-- the reason why write code like this is Class Vector has been defined as a generic:
public Class Vector<E>
      void add(E x);

-- when we see some angle brackets(invocations) shown in declaration, that is a generic. The invocation is a parameterized type. to use this generic, we need specify an actual type argument. (such as Integer as above)

3. trick in generic

-- we know that the idea of generic makes some data type such as container more flexible or acceptable for inputting entries. But that will be also very tricky. To take container as an example of generic, one of tricks is can we copy values from one container to another container? if you want to copy like following style, the answer is no.
List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //compile time error!

-- though we know String is a subtype of Object, and we can assign a value of String to an Object. But we can not assign a List of String to a List of Object as a whole part(like reference to a variable). The reason is we can access inner part of List(I mean element here, if List is as a simple data type such as Object, maybe we can do that), that will make List type unsafe. So, Java 1.5 complier will not let you do that.

-- Look inside two styles of code in above examples(of 2), we might say that the older style looks more flexible, because myList can accept more data types besides Integer, but the new style in 1.5 can only take Integer values. Well, if we need more flexible, we apply wildcards for generic.

4. Wildcards and bounded wildcards

-- if we see something like Collection<?> c, there is a question mark in angle brackets. That is Wildcard, which means type is temporarily unknown but it will be replaced by any type.
-- if we see something like Collection<? extends Number> c, that is bounded wildcard, which means the elements in Collection has a supertype bound. You can not put any other type whose supertype is not Number into this Collection.
-- But, no matter wildcard or bounded wildcard, we can not put a specified type value in it, that is because wildcard means type is unknown, you can not give a value to unknown data type.
-- So, what hell can wildcard be used for ? return back the flexible idea we mentioned before. We need apply wildcard to describe a flexible idea in definition or declaration, not to do real things.
for example, we can define an method like this:
void printCollection(Collection<?> c)
      for(Object e : c){System.out.println(e);}
see? that is flexible. You can call this function for any Collection. You can use elements in Collection<?>, just don't try to put something in it.
-- So the question is, if we wanna that flexibility for our method, and we also need put something in it during the subroutine. How can we do? and then, we need use generic method

5. Generic method
-- that means method declaration can also be parameterized.
-- example:
    public <T> void addCollection(List<T> objs, T obj)

6. when to use generic method and when to use wildcard ?
-- if the type parameter is used only once, or it has no relationship to other arguments of method including the return type, then wildcard is better to use to decribe clearer and more concise meanings.
-- otherwise, generic method should be used.
class Collection
      public static <T, S extends T> void copy(List<T> dest, List<S> src){...}
can be better rewritten as :
class Collection
      public static <T> void copy(List<T> dest, List<? extends T> src){...}


posted @ 2006-06-23 09:39 Dedian 阅读(1393) | 评论 (0)编辑 收藏;jsessionid=GZQWvln9z4JY2dXX8HyQ5f5KtRptqHRWvh17tjCXVbxHnGyzvTm2!554406865
when I try to debug my webcrawler by crawling yahoo website, I found that when trying to connect to a website which URL is such as, the following exception will happen:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 12
 at java.lang.String.substring(Unknown Source)
 at Source)
 at Source)
 at Source)
 at Source)
 at Source)

follow is simple testing code:
private static final String urlstring = "";

   URL url = new URL(urlstring);
   URLConnection con = url.openConnection();

since there are no other explicit exceptions except MalformedURLException & IOException mentioned to catch for this code, I am not sure if it is a bug in Java for URL parsing...

anybody got some idea about that?

P.S. ok, somebody has pointed out that Runtime exceptions, like java.lang.StringIndexOutOfBoundsException, do not have to be declared, but they can be thrown. So i need catch StringIndexOutOfBoundsException this exception for my code. But in my understanding, the function should catch all the exceptions from lower functions, and then throw out if it can not handle them, thus we can catch those exception from deep functions. I am not sure Runtime exceptions are exceptional ...
Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of  activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
    a. Host name aliases --> DNS Resolver
    b. Omitted port numbers
    c. Alternative paths on the same host
    d. replication across difference host
    e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...
Here is an article for effective I/O programming thought, mark it just for future re-check my I/O design in distributed searching engine system. Non-blocking synchronous mode was applied in my current system. I need check it out if anything can do to improve the performance and large scalability later.

An idea is proposed by a PHD student of University of Auckland to check your OO Design on Java. The key point is to use directed graph to analyze the dependencies between all java classes, and the more classses involved in some cycle, the worse design it is.

Several Java Open source softwares have been examed in his research report...
Though it is not the only metric to check your OO design, I'd like to say that it is an interesting thought.
Unlike collection types such as Vector or List, Map (HashTable or HashMap) accesses a value by a key. If we want to retrieve all the values that have been put in a Map, one of simple ways to do that is employing a Collection or plus an Iterator, here is the sample code (just retrieve vaules, skip keys), assuming there is a variable: HashMap<String, <ComplexDataType>> links

Collection c = links.value();
Vector<ComplexDataType> v = new Vector<ComplexDataType>(c);
for(int i = 0; i< v.size(); i++)
    ComplexDataType tempData = (ComplexDataType)v.get(i);

P.S. Map provides three views of map: keySet, entrySet and values collection, we can use any of them .
These questions are very useful for some Java newbies and guys who wanna prepare some interviews on Java programming positions, which is really cool.

1. Reading text from Standard Input
BufferedReader in = new BufferedReader(new InputStreamReader(;
String str = "";
while (str != null)
System.out.print("> some prompt ");
str = in.readLine();
catch (IOException e)

2. Reading text from a file
BufferedReader in = new BufferedReader(new FileReader("filename"));
String str;
while ((str = in.readLine()) != null)
catch (IOException e)

3. Reading a file into a BityArray

    // Returns the contents of the file in a byte array.
public static byte[] getBytesFromFile(File file) throws IOException
InputStream is = new FileInputStream(file);

// Get the size of the file
long length = file.length();

// You cannot create an array using a long type.
// It needs to be an int type.
// Before converting to an int type, check
// to ensure that file is not larger than Integer.MAX_VALUE.
if (length > Integer.MAX_VALUE)
// File is too large

// Create the byte array to hold the data
byte[] bytes = new byte[(int)length];

// Read in the bytes
int offset = 0;
int numRead = 0;
while (offset < bytes.length
&& (, offset, bytes.length-offset)) >= 0)
offset += numRead;

// Ensure all the bytes have been read in
if (offset < bytes.length)
throw new IOException("Could not completely read file "+file.getName());

// Close the input stream and return bytes
return bytes;


4. Writing to a file

BufferedWriter out = new BufferedWriter(new FileWriter("filename"));
out.write("some string");
catch (IOException e)
Note: If the file does not already exist, it is automatically created.

5. Appending to a file

BufferedWriter out = new BufferedWriter(new FileWriter("filename", true));
out.write("appending String");
catch (IOException e)

6. Using a Random Access File

File f = new File("filename");
RandomAccessFile raf = new RandomAccessFile(f, "rw");

// Read a character
char ch = raf.readChar();

// Seek to end of file;

// Append to the end
catch (IOException e)

posted @ 2006-05-31 08:12 Dedian 阅读(561) | 评论 (1)编辑 收藏

The volatile keyword is used on variables that may be modified simultaneously by other threads. This warns the compiler to fetch them fresh each time, rather than caching them in registers. This also inhibits certain optimisations that assume no other thread will change the values unexpectedly. Since other threads cannot see local variables, there is never any need to mark local variables volatile.

Though still under voting, it is originally mentioned by Doug Cutting, and got only positive votes. So it is very likely we can get a 2.0 release version on this Friday. Some bugs has been fixed and deprecated code has been removed in this approaching version.
P.S. 函数 Likely(结局n) (1<=n<=3)为严格单调递减函数,其上限为0.0001
Oops! My laptop, Compaq Presario R3230, is not working now (just worked yesterday evening), blue screen, hangs at disk checking...when I reboot with safe mode, it still hangs at is multi(0)disk(0)rdisk(0)partition(1)\windows\system32\drivers\atisgkaf.sys, I guess there is something wrong with my video driver, but how can I fix that problem without wipe out my documents in harddriver?

I am trying to google by it, it seems some guys also got that problem, some steps are suggested:

1.  Insert the QuickRestore CD into the CD drive and restart the
2.  When the red Compaq logo appears, press and hold the Caps
    Lock key.  Next screen will be a blinking QuickRestore screen.
3.  When the QuickRestore text stops blinking, press and hold the
    Num Lock key.

but where can I get QuickRestore CD? included CD seems not in my room any more...anybody has thought about that?
1。 其实要让自己的Blog的点击率狂涨的办法很简单,就是写一个最简单的webcrawler程序,不断的访问自己的主页(发送http请求),很多计数器的原理就是根据这个来计算的,而不会核实IP地址,不信,只要自己F5刷新一下自己的页面就知道了。照这样下去,点击率超过老徐是肯定没有问题的。不过,新浪本来就玩点击率猫腻的,因为他们可以自己修改计数器,所以和他们玩这个没有意义。
2。点击率高并不表示你的页面排名高(PageRank)。PageRank是一个技术含量比较高的词,想当初Google那两个毛头小伙子Larry Page(真的很巧和,那小子的姓居然是Page,真的想不做Page的老大都不行)和 Sergey Brin就是靠在斯坦福期间有关PageRank的研究发家的,如今年纪轻轻就可以和MS叫板。当然,Google的PageRank的算法是商业秘密。不过网上牛人不乏其数,居然有人根据Google的一些搜索行为和利用概率建模等数学知识硬是弄出一套PageRank的解释,在网上大为流行。那篇Paper只要Google一下PageRank Uncovered(by Chris Ridings and Mike Shishigin)就可以找到。据说,还有人利用里面的机制大大戏弄了一把Google的搜索引擎。不过已无法考证,因为Google也在不断完善自己。
5。所以,要让自己的网页或网站就有影响力,就要千方百计让别人来连接你,来引用你。当然还有一种办法,就是不断的引用别人的文章,这里的引用不是说在你自己的网页里嵌上别人的连接,而是利用别人的网页嵌上自己网页。怎么做,其实就是很多Blog的Trackback的功能,细心可以发现,只要你Trackback别人的Blog,你的Blog地址就留在别人的Blog的网页里(comments一样)。不过,现在大都的blog都开始有设置不允许别人Trackback或comments.新浪好像也开始做了手脚,名人的博客不让引用了好像,不过新浪的博客对很多的搜索引擎都不友好,也就别动他的主意了。倒是MSN space似乎可以,可以写一段代码自动连到各个网页上fetch出每个blog的permalink然后执行一段MSN自己提供的javascript就可以trackback了,不过这只是我最近想到的,还没有写代码实现。如果可以成功的话,很多其他的博客也一样可以成功。这个想法是最近老看到一些乱七八糟的网站出现在我的trackback里想到的。
posted @ 2006-05-19 16:15 Dedian 阅读(1527) | 评论 (3)编辑 收藏
+ Webcrawler
    -- study open source code
          purpose: analyze code structure and basic componences
          focus on: Nutch (
                    & HTMLParser (
                     & GData(

    -- understand PageRank idea
       relative articles:
       paper : "PageRank Uncoverd" by Chris Ridings and Mike Shishigin (about Chris Ridings & SEO) (basic idea about crawler)
    -- familar with RSS & Atom protocol

    -- sample coding:
       Interface: Scheduler for fetching web links
       Interface: Web page paser/Analyzer --> to deal with XML-based websites(Weblogs or news sites, RSS & Atom) --> Paser classes based on SAX parser
       Interface: Retractor/Fetcher --> to get links from page
       Interface: Collector --> check URL whether duplicated and save in URL database with certian data structure
       Interface: InformationProcesser --> PageRank should be one important factor --> (under thinking)
       Interface: Policies(Filter) --> will be served for Collector and InformationProcessor --> (under thinking)

+ Indexer/Searcher (almost done base on Lucene)
posted @ 2006-05-19 09:40 Dedian 阅读(296) | 评论 (1)编辑 收藏

always, if you wanna check/analyze source code or do some contribution in open source communities, you would like to download the source code of some projects and load (or import) it into your own IDE. (if you don't wanna use CVS or SVN)

Following is my favorite way to do that under Eclipse:

1. create a new blank Java project:

File -> New -> Project ... -> Java Project --> Next >> -> input the project name (project layout: Create seperate source and output folders) --> click Finish

2. right click Source Folder "src" --> import ... -> select File system -> choose correct source code folder where you put the downloaded source code by click the top "Browse..." button (source code folder means the root folder  thus can keep folder structure as package structure) --> Finish

3. if you import wrong source code folder, you can delete whole project to redo. (it is no use merely deleting some failed packages)


if there is Ant build file (some stuff like build.xml) included in source code package, that will be cool, just using File -> New -> Project... -> Java Project from Existing Ant Buildfile.
The behavior of a web crawler is the outcome of a combination of policies:

  • A selection policy that states which pages to download.
  • A re-visit policy that states when to check for changes to the pages.
  • A politeness policy that states how to avoid overloading websites.
  • A parallelization policy that states how to coordinate distributed web crawlers.

Problem Description:

I wanna build GData source code under Eclipse which contrains creating type-specific map codes, the Eclipse IDE will complain something like that:  Syntax error, parameterized types are only available if source level is 5.0


The new feature to create a type-specific map can only be supported at source level 5.0


Do some IDE compiler configuration:
Window > Preferences > Java > Compiler > Compiler compliance level => 5.0

1. type-specific map:  create a map that will hold only objects of a certain type
Map<Integer, String> map = new HashMap<Integer, String>();

map.put(1, "first");
map.put(2, "second");
2. if source level 5.0 is applied, Type-safe problem should be noticed for collection data type, such as Vector, List, Stack or Map etc.
that means, you can write code under level 1.4 like this:

private Vector MyList = new Vector();

you'd better change to some stuff like this under level 5.0:

private Vector<String> MyList = new Vector<String>();

1. Develop a searching engine merely for Weblogs (Main jobs will be on WebCrawler, Indexer and Searcher part has been done for xml-based information retrieval)

    a. Weblog is more and more popular recently
    b. Though there has some weblog search engines such as Technorati and Blogdigger, but still seems lots of work need to do.
    c. the formats of weblog feed (RSS2.0 & Atom) are xml-based and more standard, which is very close to my current job on xml-based information retrieval
    d. easily extensible for crawling xml-based information websites besides weblogs
         a. Utilize GData for feeding xml-based information
or      b. using some Open Source Crawlers + Lucene (similar idea in this article)
or      c. develop and merge my own simple Crawler package into my Shemy project which is clustering structure searching engine design based on Lucene

         likely: c > a > b (coz most open source crawlers are supposed to deal with much complex web pages/links, while since weblog feed is simpler, the crawler for it should be lighter)

Requirement/Functionality Analysis : (in progress)

Schedule: (in progress)

2. Exploration of performation tuning on searching issues to improve Shemy kernel
