环境:Nutch0.9+Fedora5+tomcat6+JDK6
tomcat和jdk都安装好;
二:nutch-0.9.tar.gz
将下载到的tar.gz包,解压到/opt目录下并改名:
#gunzip -xf nutch-0.9.tar.gz |tar xf
#mv nutch-0.9.tar.gz /usr/local/nutch
测试环境是否设置成功:运行:/urs/local/nutch/bin/nutch看一下有没有命令参数输出,如果有说明没问题。
抓取过程:#cd /opt/nutch
#mkdir urls
#vi nutch.txt 输入www.aicent.net
#vi conf/crawl-urlfilter.txt 加入以下信息:利用正则表达式对网站url抓取筛选。
/**** accept hosts in MY.DOMAIN.NAME******/
+^http://([a-z0-9]*\.)*aicent.net/
#vi nutch/nutch-site.xml(给自己的蜘蛛取一个名字)设置如下:
<configuration>
<property>
<name>http.agent.name</name>
<value>test/unique</value>
</property>
</configuration>
开始抓取:#bin/nutch crawl urls -dir crawl -detpth 5 -thread 10 >& crawl.log
等待一会,时间依据网站的大小,和设置的抓取深度。
三:apache-tomcat
在这里,当你看到每次检索的页面为0里,需要修改一下参数,因为tomcat中的nutch的检索路径不对造成的。
#vi /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
<property>
<name>searcher.dir</name>
<value>/opt/nutch/crawl</value>抓取网页所在的路径
<description>My path to nutch's searcher dir.</description>
</property>
#/opt/tomcat/bin/startup.sh
OK,搞定。。。
问题汇总:
运行:sh ./bin/nutch crawl urls -dir crawl -depth 3 -threads 60 -topN 100 >&./logs/nutch_log.log
1.Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
网上查有说是JDK版本的问题,不能用JDK1.6,于是安装1.5。但是还是同样的问题,奇怪啊。
于是继续google,发现有如下的可能:
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
说明:一般为crawl-urlfilters.txt中配置问题,比如过滤条件应为
+^http://www.ihooyo.com ,而配置成了 http://www.ihooyo.com 这样的情况就引起如上错误。
但是自己的配置根本就没有问题啊。
在Logs目录下面除了生成nutch_log.log还自动生成一个log文件:hadoop.log
发现有错误出现:
2009-07-22 22:20:55,501 INFO crawl.Crawl - crawl started in: crawl
2009-07-22 22:20:55,501 INFO crawl.Crawl - rootUrlDir = urls
2009-07-22 22:20:55,502 INFO crawl.Crawl - threads = 60
2009-07-22 22:20:55,502 INFO crawl.Crawl - depth = 3
2009-07-22 22:20:55,502 INFO crawl.Crawl - topN = 100
2009-07-22 22:20:55,603 INFO crawl.Injector - Injector: starting
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: crawlDb: crawl/crawldb
2009-07-22 22:20:55,604 INFO crawl.Injector - Injector: urlDir: urls
2009-07-22 22:20:55,605 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries.
2009-07-22 22:20:56,574 INFO plugin.PluginRepository - Plugins: looking in: /opt/nutch/plugins
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true]
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - Registered Plugins:
2009-07-22 22:20:56,720 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Query Filter (query-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Site Query Filter (query-site)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - HTTP Framework (lib-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - JavaScript Parser (parse-js)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - URL Query Filter (query-url)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Registered Extension-Points:
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Summarizer (org.apache.nutch.searcher.Summarizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
2009-07-22 22:20:56,721 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
2009-07-22 22:20:56,722 INFO plugin.PluginRepository - Ontology Model Loader (org.apache.nutch.ontology.Ontology)
2009-07-22 22:20:56,786 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using default
2009-07-22 22:20:56,829 WARN mapred.LocalJobRunner - job_2319eh
java.lang.RuntimeException: java.net.UnknownHostException: jackliu: jackliu
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:617)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:591)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:364)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:390)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.startPartition(MapTask.java:294)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:355)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$100(MapTask.java:231)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:180)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:126)
Caused by: java.net.UnknownHostException: jackliu: jackliu
at java.net.InetAddress.getLocalHost(InetAddress.java:1353)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:614)
... 8 more
也就是Host配置错误,于是:
Add the following to your /etc/hosts file
127.0.0.1 jackliu
这次再次运行,结果成功!
2:http://127.0.0.1:8080/nutch-0.9
输入nutch进行查询,结果报错:
HTTP Status 500 -
type Exception report
message
description The server encountered an internal error () that prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: /search.jsp(151,22) Attribute value language + "/include/header.html" is quoted with " which must be escaped when used within the value
org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:40)
org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:407)
org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:198)
org.apache.jasper.compiler.Parser.parseQuoted(Parser.java:299)
org.apache.jasper.compiler.Parser.parseAttributeValue(Parser.java:249)
org.apache.jasper.compiler.Parser.parseAttribute(Parser.java:211)
org.apache.jasper.compiler.Parser.parseAttributes(Parser.java:154)
org.apache.jasper.compiler.Parser.parseInclude(Parser.java:867)
org.apache.jasper.compiler.Parser.parseStandardAction(Parser.java:1134)
org.apache.jasper.compiler.Parser.parseElements(Parser.java:1461)
org.apache.jasper.compiler.Parser.parse(Parser.java:137)
org.apache.jasper.compiler.ParserController.doParse(ParserController.java:255)
org.apache.jasper.compiler.ParserController.parse(ParserController.java:103)
org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:170)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:332)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:312)
org.apache.jasper.compiler.Compiler.compile(Compiler.java:299)
org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:586)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:317)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
note The full stack trace of the root cause is available in the Apache Tomcat/6.0.20 logs.
分析:查看nutch Web应用根目录下的search.jsp可知,是引号匹配的问题。
<jsp:include page="<%= language + "/include/header.html"%>"/> //line 152 search.jsp
第一个引号和后面第一个出现的引号进行匹配,而不是和这一行最后一个引号进行匹配,所以问题就出现了。
解决方法:
将该行代码修改为:<jsp:include page="<%= language+urlsuffix %>"/>
这里我们定一个字符串urlsuffix,我们把它定义在language字符串定义之后,
String language = // line 116 search.jsp
ResourceBundle.getBundle("org.nutch.jsp.search", request.getLocale())
.getLocale().getLanguage();
String urlsuffix="/include/header.html";
修改完成后,为确保修改成功,重启一下Tomcat服务器,进行搜索,不再报错。
3.无法查询结果?
对比nutch_log.log的结果发现和网上描述的不同,而且crawl里面只有两个文件夹segments和crawldb,后来重新爬了一次
发现这次是好的,奇怪不知道为什么上次爬的失败了。
4.cached.jsp explain.jsp等都有上面3的错误,更正过去就OK了。
5.今天花了一上午和半个下午的时间终于搞定了nutch的安装和配置了。明天继续学习。