1. Getting the IP Address of a Hostname

InetAddress addr = InetAddress.getByName("");
byte[] ipAddr = addr.getAddress();

// Convert to dot representation
String ipAddrStr = "";
for (int i=0; i<ipAddr.length; i++) {
if (i > 0) {
ipAddrStr += ".";
ipAddrStr += ipAddr[i]&0xFF;
catch (UnknownHostException e) {

2. Getting the Hostname of an IP Address

This example attempts to retrieve the hostname for an IP address. Note that getHostName() may not succeed, in which case it simply returns the IP address.

try {
// Get hostname by textual representation of IP address
InetAddress addr = InetAddress.getByName("");

// Get hostname by a byte array containing the IP address
byte[] ipAddr = new byte[]{127, 0, 0, 1};
addr = InetAddress.getByAddress(ipAddr);

// Get the host name
String hostname = addr.getHostName();

// Get canonical host name
String hostnameCanonical = addr.getCanonicalHostName();
} catch (UnknownHostException e) {

3. Getting the IP Address and Hostname of the Local Machine

    try {
InetAddress addr = InetAddress.getLocalHost();

// Get IP Address
byte[] ipAddr = addr.getAddress();

// Get hostname
String hostname = addr.getHostName();
} catch (UnknownHostException e) {

In the last digest about Greatest software ever written, I noted a worm named Morris which is ranked 12 of greatest software by the author. Actually, after finishing my clustering searching enigne development which is based on Lucene, i am studying p2p architecture for my distributed searching engine (more precisely is webcrawler part). When I am reading some p2p loopup protocol papers such as Chord, I also noticed a guy named Morris who is one of the developers. Hmmm,  this is the same Morris, from wiki, I know that guys is now an associate professor in MIT, and was indicted because of the damage by his Morris worm. Anyway, I'd like to say that it is very interesting to know some stories about those geeks.
12. The Morris worm
11. Google search rank
10. Apollo guidance system
9. Excel spreadsheet
8. Macintosh OS
7. Sabre system
6. Mosaic browser
5. Java language
4. IBM System 360 OS
gene-sequencing software at the Institute for Genomic Research
2. IBM's System R
1. Unix System III

How r u thinking?
大伙都知道,Google是运行在很多的Linux(GNU)系统的服务器上的,而这只是它支持免费软件的一个方面。其他的比如,Summer of Code, 现在已成为一个生产很多优秀代码和项目的孵化基地,并且最近开放的Code Repository, 大有取代笔者注:广大开源的据点)之趋势。一方面,Google贡献出它的Picasa(Linux(GNU)平台)(笔者注:一个图片管理软件),并被Wine(笔者注:Linux/Unix上的Windows,建于x-window之上)所使用;另一方面,Google也赞助一些开源项目,如Sri Lanka,大概有$25,000之多。
当然,Google也会秘密地进行一些开源的资助。比如,令我们大伙惊讶的Mozilla Foundation(笔者注:大家熟悉的另一浏览器Firefox)居然在去年有赚到72个million -- 就是在Firefox上把Google的搜索引擎作为缺省的搜索引擎。

2005年的1月份,Google把Ben Goodger招为靡下。此人乃Firefox的首席工程师,并且是几个主要开源编码者之一。到了年末,Guido van Rossum, Python的始创人,也加入了Google。最近,Linux2.6核心的维护人,Andrew Morton也宣称即将离开OSDL并投奔到Google.


记得在最初的那些年代里,人们都为着自己的兴趣爱好在业余时间里一边工作一边学习地奋力地写着自己的代码。突然,第一个.com的时代来临,不少早期的开源公司开始聘请顶级程序员:如核心编码员Alan Cox, David Miller,Stephen Tweedie等人纷纷来到Red Hat, 还有一些去了Linuxcare。





那些关于用了开源的代码的公司是不是也应该开放他们的代码的争论不仅仅涉及到Google。其他的一些主要得益者如Yahoo, 其最近正活跃于收购一些Web 2.0的公司如,这些都很显然有着开源的印记,当然它没有Google那样与开源的关系那么源远流长,不过Yahoo也开始着手吸引开源人才。
People are still talking about web 2.0, I am not sure that is pure technical term. In my understanding, maybe most of meaning of web 2.0 is its marketing meaning. that is, web is becoming commonality and people generate the web's content. Again, i am not sure what is the place of web service in web 2.0, in my understanding, the web is not merely client-server marketing model (I am not talking web structure here), but an interactive community. But question is , who gonna be the operator or administrator of this community or if there are any game rules needed to follow ? will that be another utopian ?

Well, on a technical layer, I'd like to shed some lights on so-called web standard trends

1. front end --
         CSS ----> layout
         XML ----> data 
         XHTML ----> markup
         Javascript & DOM ----> behavior + XMLHttpRequest --> AJAX ?

2. back end -- 
         some open source projects such as Ruby on Rail...

let me know how you are thinking...

作为LuceneNutch两大Apach Open Source Project的始创人(其实还有Lucy, Lucene4C 和Hadoop等相关子项目),Doug Cutting 一直为搜索引擎的开发人员所关注。他终于在为Yahoo以Contractor的身份工作4年后,于今年正式以Employee的身份加入Yahoo

下面是笔者在工作之余,翻译其一篇2年前的访谈录,原文(Doug Cutting Interview)在网上Google一下就容易找到。希望对搜索引擎开发的初学者起到一个抛砖引玉的效果。



我主要在家从事两个与搜索有关的开源项目的开发: Lucene和Nutch. 钱主要来自于一些与这些项目相关的一些合同中。目前Yahoo! Labs 有一部分赞助在Nutch上。这两个项目还有一些其他的短期合同 。






 -- 攫取(fetching):就是把被指向的网页下载下来。
 -- 数据库:保存攫取的网页信息,比如那些网页已经被攫取,什么时候被攫取的以及他们又有哪些链接的网页等等。
 -- 链接分析:对刚才数据库的信息进行分析,给每个网页加上一些权值(比如PageRank,WebRank什么的),以便对每个网页的重要性有所估计。不过,在我看来,索引那些网页标记(Anchor)里面的内容更为重要。(这也是为什么诸如Google Bombing如此高效的原因)
 -- 索引(Indexing): 就是对攫取的网页内容,以及链入链接,链接分析权值等信息进行索引以便迅速查询。
 -- 搜索(Searching): 就是通过一个索引进行查询然后按照网页排名显示。



很不幸,估计他们大都没戏。因为Nutch还是需要一个Java servlet的容器(笔者注:比如Tomcat)。而这个有些ISP支持,但大都不支持。(笔者注: 只有对Apache服务器有掌控权,你才能在上面安装一个Tomcat之类的东东)

5。我可以把Lucene和Google Web API结合起来吗?或者和其他的一些我先前写过的应用程序结合起来?

有那么一帮人已经为Nutch写了一些类似Google的API, 但还没有一个融入现在的系统。估计不久的将来就行了。

6。你认为目前实现一个搜索引擎最大的障碍在哪里?是硬件,存储障碍还是排名算法?还有,你能不能告诉我大概需要多大的空间搜索引擎才能正常工作,就说我只想写一个针对搜索成千上百万的RSS feeds的一个搜索引擎吧。

Nutch大概一个网页总共需要10kb的空间吧。Rss feeds的网页一般都比较小(笔者注: Rss feeds都是基于xml的文本网页,所以不会很大),所以应该更好处理吧。当然Nutch目前还没有针对RSS的支持。(笔者注:实际上,API里面有针对RSS的数据结构和解析)

7。从Yahoo! Labs拿到资金容易吗?哪些人可以申请?你又要为之做出些什么作为回报?



我和那边的一些家伙谈过,包括Larry Page(笔者注: Google两个创始人之一)。他们都很愿意提供一些帮助,但是他们也无法找到一种不会帮助到他们竞争对手的合适方式。













具体的在这篇文章 micro/mi2003/ m2022.pdf)中有所描述。


这个,我们还没有腾出时间做这块。不过,很显然这是一个很重要的领域。在我们进入链接场之前,我们需要做一些简单的事情:察看词汇填充(Word stuffing)(笔者注:就是在网页里嵌入一些特殊的词汇,并且出现很多的次,甚至上百次,有些是人眼看不到的,比如白板写白字等伎俩,这也是Spamdexing方法的一种),白板写白字(White-on-white text),等等。

我想在一般意义上来说(垃圾信息检测是其中的一个子问题),搜索质量的关键在于拥有一个对查询结果手工可靠评估的辅助措施。这样,我们可以训练一个排名算法从而产生更好的查询结果(垃圾信息的查询结果是一种坏的查询结果)。商业的搜索引擎往往会雇佣一些人进行可靠评估。Nutch也会这样做,但很显然我们不能只接受那些友情赞助的评估,因为那些垃圾信息制造者很容易会防止那些评估。因此我们需要一种手段去建立一套自愿评估者的信任体制。我认为一个平等评论系统(peer-review system),有点像Slashdot的karma系统, 应该在这里很有帮助。




--  Getting Ready to Use CVS

First set the variable CVSROOT to /class/`username`/cvsroot
[Or any other directory you wish]
[For csh/tcsh: setenv CVSROOT ~/cvsroot]
[For bash/ksh: CVSROOT=~/cvsroot;export CVSROOT]

Next run cvsinit. It will create this directory along with the subdirectory CVSROOT and put several files into CVSROOT.

-- How to put a project under CVS

A simple program consisting of multiple files is in /workspaces/project.

To put this program under cvs first

cd to /workspaces/project


cvs import -m "Sample Program" project sample start

CVS should respond with
N project/Makefile
N project/main.c
N project/bar.c
N project/foo.c

No conflicts created by this import

If your were importing your own program, you could now delete the original source.
(Of course, keeping a backup is always a good idea)

-- Basic CVS Usage

Now that you have added 'project' to your CVS repository, you will want to be able to modify the code.

To do this you want to check out the source. You will want to cd to your home directory before you do this.


cvs checkout project

CVS should respond with
cvs checkout: Updating project
U project/Makefile
U project/bar.c
U project/foo.c
U project/main.c

This creates the project directory in your home directory and puts the files: Makefile, bar.c, foo.c, and main.c into the directory along with a CVS directory which stores some information about the files.

You can now make changes to any of the files in the source tree.
Lets say you add a printf("DONE\n"); after the function call to bar()
[Or just cp /class/bfennema/project_other/main2.c to main.c]

Now you have to check in the new copy

cvs commit -m "Added a DONE message." main.c

CVS should respond with
Checking in main.c;
/class/'username'/cvsroot/project/main.c,v <-- main.c
new revision: 1.2; previous revision: 1.1

Note, the -m option lets you define the checking message on the command line. If you omit it you will be placed into an editor where you can type in the checking message.

-- Using CVS with Multiple Developers

To simulate multiple developers, first create a directory for your second developer.
Call it devel2 (Create it in your home directory).
Next check out another copy of project.
  • HINT: cvs checkout project
Next, in the devel2/project directory, add a printf("YOU\n"); after the printf("BAR\n");
[Or copy /class/bfennema/project_other/bar2.c to bar.c]

Next, check in bar.c as developer two.
  • HINT: cvs commit -m "Added a YOU" bar.c
Now, go back to the original developer directory.
[Probably /class/'username'/project]

Now look at bar.c. As you can see, the change made by developer one has no been integrated into your version. For that to happen you must

cvs update bar.c

CVS should respond with
U bar.c

Now look at bar.c. It should now be the same as developer two's.
Next, edit foo.c as the original developer and add printf("YOU\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo2.c to foo.c]

Then check in foo.c

  • HINT: cvs commit -m "Added YOU" foo.c
Next, cd back to developer two's directory.
Add printf("TOO\n"); after the printf("FOO\n");
[Or copy /class/bfennema/project_other/foo3.c to foo.c]

Now type

cvs status foo.c

CVS should respond with
File: foo.c             Status: Needs Merge

   Working revision: 'Some Date'
   Repository revision: 1.2     /class/'username'/cvsroot/project/foo.c,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)
The various status of a file are:
    The file is identical with the latest revision in the repository.
Locally Modified
    You have edited the file, and not yet committed your changes.
Needing Patch
    Someone else has committed a newer revision to the repository.
Needs Merge
    Someone else has committed a newer revision to the repository, and you have also made modifications to the file.

Therefore, this is telling use we need to merge our changes with the changes made by developer one. To do this

cvs update foo.c

CVS should respond with
RCS file: /class/'username'/cvsroot/project/foo.c,v
retrieving revision
retrieving revision 1.2
Merging differences between and 1.2 into foo.c
rcsmerge: warning: conflicts during merge
cvs update: conflicts found in foo.c
C foo.c

Since the changes we made to each version were so close together, we must manually adjust foo.c to look the way we want it to look. Looking at foo.c we see:
void foo()
<<<<<<< foo.c
>>>>>>> 1.2

We see that the text we added as developer one is between the ======= and the >>>>>>> 1.2.
The text we just added is between the ======= and the <<<<<<< foo.c

To fix this, move the printf("TOO\n");to after the printf("YOU\n");line and delete the additional lines the CVS inserted. [Or copy /class/bfennema/project_other/foo4.c to foo.c]
Next, commit foo.c

cvs commit -m "Added TOO" foo.c

Since you issued a cvs update command and integrated the changes made by developer one, the integrated changes are committed to the source tree.

-- Additional CVS Commands

To add a new file to a module:
  • Get a working copy of the module.
  • Create the new file inside your working copy.
  • use cvs add filename to tell CVS to version control the file.
  • use cvs commit filename to check in the file to the repository.

Removing files from a module:
  • Make sure you haven't made any uncommitted modifications to the file.
  • Remove the file from the working copy of the module. rm filename.
  • use cvs remove filename to tell CVS you want to delete the file.
  • use cvs commit filename to actually perform the removal from the repository.

For more information see the cvs man pages or the file in cvs-1.7/doc.

copy from
When reading GData source code, you will find that there are lots of generic-style code in it, which is one of several extensions of JDK 1.5. If you are using java 1.5 compiler, it is surely deserved to get some ideas about generic. Be noticed that Java generic looks like C++ Temple, but is quite different.

1. what is the idea of generic?
To simply say, generic is an idea of parameterizing type, including class type and other data types.

2. examples?
-- We are familar with some container types, such as Collection. Here is an example for our former (Java 1.4 or before) typical usage:
Vector myList = new Vector();
myList.add(new Integer(100));
Integer value = (Integer)myList.get(0);

now it is better to write like this for type safety: (Eclipse IDE will display type safety warnings for above code if under java 1.5 compiler option)
  Vector<Integer> myList = new Vector<Integer>();
  myList.add(new Integer(100));
  Integer value = myList.get(0);

-- the reason why write code like this is Class Vector has been defined as a generic:
public Class Vector<E>
      void add(E x);

-- when we see some angle brackets(invocations) shown in declaration, that is a generic. The invocation is a parameterized type. to use this generic, we need specify an actual type argument. (such as Integer as above)

3. trick in generic

-- we know that the idea of generic makes some data type such as container more flexible or acceptable for inputting entries. But that will be also very tricky. To take container as an example of generic, one of tricks is can we copy values from one container to another container? if you want to copy like following style, the answer is no.
List<String> ls = new ArrayList<String>();
List<Object> lo = ls; //compile time error!

-- though we know String is a subtype of Object, and we can assign a value of String to an Object. But we can not assign a List of String to a List of Object as a whole part(like reference to a variable). The reason is we can access inner part of List(I mean element here, if List is as a simple data type such as Object, maybe we can do that), that will make List type unsafe. So, Java 1.5 complier will not let you do that.

-- Look inside two styles of code in above examples(of 2), we might say that the older style looks more flexible, because myList can accept more data types besides Integer, but the new style in 1.5 can only take Integer values. Well, if we need more flexible, we apply wildcards for generic.

4. Wildcards and bounded wildcards

-- if we see something like Collection<?> c, there is a question mark in angle brackets. That is Wildcard, which means type is temporarily unknown but it will be replaced by any type.
-- if we see something like Collection<? extends Number> c, that is bounded wildcard, which means the elements in Collection has a supertype bound. You can not put any other type whose supertype is not Number into this Collection.
-- But, no matter wildcard or bounded wildcard, we can not put a specified type value in it, that is because wildcard means type is unknown, you can not give a value to unknown data type.
-- So, what hell can wildcard be used for ? return back the flexible idea we mentioned before. We need apply wildcard to describe a flexible idea in definition or declaration, not to do real things.
for example, we can define an method like this:
void printCollection(Collection<?> c)
      for(Object e : c){System.out.println(e);}
see? that is flexible. You can call this function for any Collection. You can use elements in Collection<?>, just don't try to put something in it.
-- So the question is, if we wanna that flexibility for our method, and we also need put something in it during the subroutine. How can we do? and then, we need use generic method

5. Generic method
-- that means method declaration can also be parameterized.
-- example:
    public <T> void addCollection(List<T> objs, T obj)

6. when to use generic method and when to use wildcard ?
-- if the type parameter is used only once, or it has no relationship to other arguments of method including the return type, then wildcard is better to use to decribe clearer and more concise meanings.
-- otherwise, generic method should be used.
class Collection
      public static <T, S extends T> void copy(List<T> dest, List<S> src){...}
can be better rewritten as :
class Collection
      public static <T> void copy(List<T> dest, List<? extends T> src){...}


when I try to debug my webcrawler by crawling yahoo website, I found that when trying to connect to a website which URL is such as, the following exception will happen:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 12
 at java.lang.String.substring(Unknown Source)
 at Source)
 at Source)
 at Source)
 at Source)
 at Source)

follow is simple testing code:
private static final String urlstring = "";

   URL url = new URL(urlstring);
   URLConnection con = url.openConnection();

since there are no other explicit exceptions except MalformedURLException & IOException mentioned to catch for this code, I am not sure if it is a bug in Java for URL parsing...

anybody got some idea about that?

P.S. ok, somebody has pointed out that Runtime exceptions, like java.lang.StringIndexOutOfBoundsException, do not have to be declared, but they can be thrown. So i need catch StringIndexOutOfBoundsException this exception for my code. But in my understanding, the function should catch all the exceptions from lower functions, and then throw out if it can not handle them, thus we can catch those exception from deep functions. I am not sure Runtime exceptions are exceptional ...
Still working on Webcrawler part, the URL collection strategies are under thinking. A URL frontier which stores the list of  activate URLs to be parsed or downloaded will be applied to handle for synchonized I/O operations with URL collection/Inventory, stuck by some issues:

1. Duplicate URL Elimination:
    a. Host name aliases --> DNS Resolver
    b. Omitted port numbers
    c. Alternative paths on the same host
    d. replication across difference host
    e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler job, how to split crawler jobs and distribute to different machines

seems that I need couple days to refine my systen architecture design...
Here is an article for effective I/O programming thought, mark it just for future re-check my I/O design in distributed searching engine system. Non-blocking synchronous mode was applied in my current system. I need check it out if anything can do to improve the performance and large scalability later.

