无法获得apache nutch来抓取 – 权限和JAVA_HOME被怀疑
我正在尝试按照NutchTutorial运行基本爬网:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
所以我安装了Nutch并安装了Solr。 我将我的.bashrc
$ JAVA_HOME设置为/usr/lib/jvm/java-1.6.0-openjdk-amd64
。
当我从nutch主目录运行bin/nutch
时,我没有看到任何问题,但是当我尝试运行上面的爬网时,我收到以下错误:
log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /usr/share/nutch/logs/hadoop.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:207) at java.io.FileOutputStream.(FileOutputStream.java:131) at org.apache.log4j.FileAppender.setFile(FileAppender.java:290) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471) at org.apache.log4j.LogManager.(LogManager.java:125) at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:270) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:281) at org.apache.nutch.crawl.Crawl.(Crawl.java:43) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2013-06-28 16:24:53 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.crawl.Injector.inject(Injector.java:296) at org.apache.nutch.crawl.Crawl.run(Crawl.java:132) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
我怀疑它可能与文件权限有关,因为我必须在此服务器上的几乎所有内容上运行sudo
,但如果我使用sudo
运行相同的crawl命令,我会得到:
Error: JAVA_HOME is not set.
所以我觉得我在这里遇到了陷阱22。 我是否应该能够使用sudo
运行此命令,或者是否还有其他我需要做的事情,以便我不必使用sudo
运行它并且它可以工作,或者还有其他完全在这里发生的事情?
看来,作为普通用户,您没有权限写入/usr/share/nutch/logs/hadoop.log
,这作为安全function是有意义的。
要解决此问题,请创建一个简单的bash脚本:
#!/bin/sh export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
将其保存为nutch.sh
,然后使用sudo
运行它:
sudo sh nutch.sh
解决此问题的关键是将JAVA_HOME
变量添加到您的sudo
环境中。 例如,键入env
和sudo env
,您将看到没有为sudo
设置JAVA_HOME
。 要解决此问题,您需要添加路径。
- 运行
sudo visudo
来编辑/etc/sudoers
文件。 (不要使用标准文本编辑器。这个特殊的vi文本编辑器将在允许您保存之前validation语法。) -
添加此行:
Defaults env_keep+="JAVA_HOME"
在
Defaults env_keep
部分的末尾。 - 重启