1. Develop a searching engine merely for Weblogs (Main jobs will be on
WebCrawler, Indexer and Searcher part has been done for xml-based
information retrieval)
Motivation:
a. Weblog is more and more popular recently
b. Though there has some weblog search engines such as
Technorati and
Blogdigger, but still seems lots of work need to do.
c. the formats of weblog feed (
RSS2.0 &
Atom)
are xml-based and more standard, which is very close to my current job
on xml-based information retrieval
d. easily extensible for crawling xml-based information websites besides weblogs
HOWTO:
a. Utilize
GData for feeding xml-based information
or b. using some
Open Source Crawlers + Lucene (similar idea in
this article)
or c. develop and merge my own simple Crawler
package into my Shemy project which is clustering structure searching
engine design based on
Lucene
likely: c > a >
b (coz most open source crawlers are supposed to deal with much complex
web pages/links, while since weblog feed is simpler, the crawler for it
should be lighter)
Requirement/Functionality Analysis : (in progress)
Schedule: (in progress)
2. Exploration of performation tuning on searching issues to improve Shemy kernel