It's a sunny day! MiniSpider v0.1 has Launched!
You can visit the page generated by Mini Spider v0.1. It's simple, but just a beginning.
First, I think I should explain it.
Every day, I visit many websites to read news about Java, TSS, infoq, BlogJava and so on. To avoid missing any important news , I should visit them dividedly. It costs me so much time! so I want to change it! I want to be a indolent man!
As you see on the page, what's the important? yeah, Just title! If I am interesting in the title, I will visit it and get it's content. if not, All is unuseful for me!
So, the point of MiniSpider is: get the titles from some websites, then create a new page.
Then, you can visit a website to get All. It's so cool, couldn't you think that?!
About the technology, it bases on the Python language.
How can I get the title?
for example, you will see the code snippet in many websites below:
<
h3
><
a
href
="http://www.blogjava.net/alwayscy/archive/2006/12/03/85161.html"
>
用OpenSSL与JAVA(JSSE)通信
</
a
></
h3
>
yes, to get the title, we just need the link and description.
Now, I use the way:
1. get String between the tag "<h3>" and "</h3>".
2. get the link and description from the String.
3. generate the new String like this:
<a href="http://www.blogjava.net/alwayscy/archive/2006/12/03/85161.html" target="_blank">用OpenSSL与JAVA(JSSE)通信</a> 4. create the new page using the new String.
Now, MiniSpider contains 3 main modules:
1. get the String between 2 Tags
yeah, the choice of Tags is very important. first, it should be common for getting title, second it should be individual for the other html source.
Now, In the MiniSpider, you can configure the tags manually in a .ini file.
use Python re lib. yeah, you should want to ask: why use regular expression? why not sgmllib? the tags are also html tags! The question is great! I just say: it's a way, and I continue to find a better way. It needs time.
2. get the link and description
use Python sgmllib.
3. Html Template
It's the first time to use html template. It give me great experience! I like it very much!
use Cheetah lib.
Now, MiniSpider just support A website, the others need test.
What's more?
As I said before, it's a beginning. There are so much work to do. many websites support, multi-threads parse, and so on. And, the generated page is ugly too. :)
But, I think the work is interesting, I can use some NTs in it. for example, Ajax.
Then, I hope you can join it. Welcome to you!
About the codes, I think I should put it on a public place, SVN may be a better way. I will do it as soon as quickly. But before it, you should wait for some time.
Thanks!