无法获得apache nutch来抓取 – 权限和JAVA_HOME被怀疑

我正在尝试按照NutchTutorial运行基本爬网:

bin/nutch crawl urls -dir crawl -depth 3 -topN 5 

所以我安装了Nutch并安装了Solr。 我将我的.bashrc $ JAVA_HOME设置为/usr/lib/jvm/java-1.6.0-openjdk-amd64

当我从nutch主目录运行bin/nutch时,我没有看到任何问题,但是当我尝试运行上面的爬网时,我收到以下错误:

 log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /usr/share/nutch/logs/hadoop.log (Permission denied) at java.io.FileOutputStream.openAppend(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:207) at java.io.FileOutputStream.(FileOutputStream.java:131) at org.apache.log4j.FileAppender.setFile(FileAppender.java:290) at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:164) at org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:216) at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:257) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:133) at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:97) at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:689) at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:647) at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:544) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:440) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:476) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:471) at org.apache.log4j.LogManager.(LogManager.java:125) at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:73) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:270) at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:281) at org.apache.nutch.crawl.Crawl.(Crawl.java:43) log4j:ERROR Either File or DatePattern options are not set for appender [DRFA]. solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2013-06-28 16:24:53 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: total number of urls rejected by filters: 0 Injector: total number of urls injected after normalization and filtering: 1 Injector: Merging injected urls into crawl db. Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357) at org.apache.nutch.crawl.Injector.inject(Injector.java:296) at org.apache.nutch.crawl.Crawl.run(Crawl.java:132) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) 

我怀疑它可能与文件权限有关,因为我必须在此服务器上的几乎所有内容上运行sudo ,但如果我使用sudo运行相同的crawl命令,我会得到:

 Error: JAVA_HOME is not set. 

所以我觉得我在这里遇到了陷阱22。 我是否应该能够使用sudo运行此命令,或者是否还有其他我需要做的事情,以便我不必使用sudo运行它并且它可以工作,或者还有其他完全在这里发生的事情?

看来,作为普通用户,您没有权限写入/usr/share/nutch/logs/hadoop.log ,这作为安全function是有意义的。

要解决此问题,请创建一个简单的bash脚本:

 #!/bin/sh export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-amd64 bin/nutch crawl urls -dir crawl -depth 3 -topN 5 

将其保存为nutch.sh ,然后使用sudo运行它:

 sudo sh nutch.sh 

解决此问题的关键是将JAVA_HOME变量添加到您的sudo环境中。 例如,键入envsudo env ,您将看到没有为sudo设置JAVA_HOME 。 要解决此问题,您需要添加路径。

  1. 运行sudo visudo来编辑/etc/sudoers文件。 (不要使用标准文本编辑器。这个特殊的vi文本编辑器将在允许您保存之前validation语法。)
  2. 添加此行:

     Defaults env_keep+="JAVA_HOME" 

    Defaults env_keep部分的末尾。

  3. 重启