Still working on Webcrawler part, the URL collection strategies are
under thinking. A URL frontier which stores the list of activate URLs to
be parsed or downloaded will be applied to handle for synchonized I/O operations with
URL collection/Inventory, stuck by some issues:
1. Duplicate URL Elimination:
a. Host name aliases --> DNS Resolver
b. Omitted port numbers
c. Alternative paths on the same host
d. replication across difference host
e. non-sense links or session IDs embedded in URLs ?
2. Reachable of URL
3. Distributed Storage of URL Inventory and relative synchronization problem
4. Fetch strategies for URL Frontier or Fetchor to get activate links for parsing
5. Scheduler for fetching and updating URL collection: multi-thread or
single thread on each pc, when to decide re-parsing a page
7. URL-Seen test: if that page has been parsed and should it re-parse? which should be done before entering URL frontier...
8. Extensibility issues for those modules: Fetcher, Extractor/Filters, Collector...
9. Checkpointing for crawlering interupted: how to resume the crawler
job, how to split crawler jobs and distribute to different machines
seems that I need couple days to refine my systen architecture design...