2008年7月10日随笔档案 - 自己的小屋

随笔-8 评论-31 文章-0 trackbacks-0

2008年7月10日

Nutch-Crawl: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http

我在Run Nutch的时候出现这样的错误 -

08/07/07 04:05:41 INFO conf.Configuration: found resource crawl-urlfilter.txt at file:/home/hut/installfiles/nutch-0.9/out/production/nutch-0.9/crawl-urlfilter.txt
08/07/07 04:05:41 INFO conf.Configuration: found resource parse-plugins.xml at file:/home/hut/installfiles/nutch-0.9/out/production/nutch-0.9/parse-plugins.xml
08/07/07 04:05:41 INFO fetcher.Fetcher: fetching http://www.yale.edu/
08/07/07 04:05:41 INFO fetcher.Fetcher: fetching http://www.harvard.edu/
08/07/07 04:05:41 INFO fetcher.Fetcher: fetch of http://www.harvard.edu/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http
08/07/07 04:05:41 INFO fetcher.Fetcher: fetch of http://www.yale.edu/ failed with: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http

解决方法：nutch-site.xml

    <property>
        <name>plugin.includes</name>
        <value>
            nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
        </value>
        <description>Regular expression naming plugin directory names to
            include. Any plugin not matching this expression is excluded.
            In any case you need at least include the nutch-extensionpoints plugin. By
            default Nutch includes crawling just HTML and plain text via HTTP,
            and basic indexing and search plugins. In order to use HTTPS please enable
            protocol-httpclient, but be aware of possible intermittent problems with the
            underlying commons-httpclient library.
        </description>
    </property>

nutch-extensionpoints|被我错误的删除了,还原以后一切工作正常. 默认情况下nutch0.9的目录结构中并没有plugin.includes这个properties, 它会载入nutch-default.xml里面的plugin.includes所以定义的所有的plugin. 在nutch-site.xml编辑/加入 plugin.includes properties的目的是为了加入我们自己的plugin而覆盖nutch-default.xml定义的.

posted @ 2008-07-10 11:38 自己的小屋阅读(2360) | 评论 (0) | 编辑收藏

Nutch-Crawl: ArrayIndexOutOfBoundsException

Nutch0.9 Crawl在Run的时候，有时候会出现 -

java.lang.ArrayIndexOutOfBoundsException: -1

at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:113)

at org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)

at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)

at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)

at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)

Exception in thread "main" java.io.IOException: Job failed!

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)

at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)

at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)

问题的解决方法:

https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515955

posted @ 2008-07-10 11:32 自己的小屋阅读(635) | 评论 (0) | 编辑收藏

常用链接

留言簿(4)

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜