ClassNotFoundException在修改后的SimpleShortestPathsVertex上运行GiraphRunner

我对Giraph比较陌生,我正在努力让我的Giraph edit-compile-deploy循环适用于我们的代码。 我能够运行各种灵感来自http://blog.cloudera.com/blog/2014/02/how-to-write-and-run-giraph-jobs-on-hadoop/的例子,但我坚持不懈运行我的SimpleShortestPathsVertex Giraph示例的修改版本时出现ClassNotFoundException。 我已经尝试过-libjars和HADOOP_CLASSPATH的各种组合,但我没有想法,我真的很感谢你的帮助。 细节如下。

版本

  • Hadoop:Hadoop 2.0.0-cdh4.4.0
  • Giraph:giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar

PageRankBenchmark运行正常

$ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \ org.apache.giraph.benchmark.PageRankBenchmark \ -Dgiraph.zkList=:2181 \ -e 1 -s 3 -v -V 50 -w 1 ... 14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015 ... (full output is below) 

GiraphRunner SimpleShortestPathsVertex也运行正常

 $ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \ org.apache.giraph.GiraphRunner \ -Dgiraph.zkList=:2181 \ org.apache.giraph.examples.SimpleShortestPathsVertex \ -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \ -vip ginput/tiny_graph.txt \ -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ -op goutput/shortestpathsC2 \ -ca SimpleShortestPathsVertex.source=2 \ -w 1 ... 14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017 ... (full output is below) 

奖励:结果是正确的:

 $ hadoop fs -cat goutput/shortestpathsC2/p* 0 1.0 2 2.0 1 0.0 3 1.0 4 5.0 

但是我的SimpleShortestPathsVertex的修改版本获得了ClassNotFoundException

包含修改后的顶点(KdlSimpleShortestPathsVertex,没有包)的jar是可以的:

 $ jar -tf ~/kdl_hadoop_play.jar META-INF/MANIFEST.MF KdlSimpleShortestPathsVertex.class META-INF/ 

但我的奔跑呕吐:

 $ hadoop jar $GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \ org.apache.giraph.GiraphRunner \ -Dgiraph.zkList=:2181 \ -libjars ~/kdl_hadoop_play.jar \ KdlSimpleShortestPathsVertex \ -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \ -vip /user/cornell/ginput/tiny_graph.txt \ -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ -op /user/cornell/goutput/shortestpathsC2 \ -ca KdlSimpleShortestPathsVertex.source=2 \ -w 1 Exception in thread "main" java.lang.ClassNotFoundException: KdlSimpleShortestPathsVertex at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:210) at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147) at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:208) 

我最好的猜测……

…环顾四周后,GiraphRunner可能没有正确处理-libjars,正如http://grepalex.com/2013/02/25/hadoop-libjars/暗示的那样(“确保你的代码使用的是GenericOptionsParser” )。 浏览Giraph源代码,我看不到该类访问过。 我尝试将HADOOP_CLASSPATH设置为我的jar,但这并没有解决问题。

任何帮助都是极好的!

PageRankBenchmark输出

 14/08/01 11:42:27 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4) 14/08/01 11:42:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/08/01 11:42:28 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything 14/08/01 11:42:29 INFO mapred.JobClient: Running job: job_201407291058_0015 14/08/01 11:42:30 INFO mapred.JobClient: map 0% reduce 0% 14/08/01 11:42:40 INFO mapred.JobClient: map 50% reduce 0% 14/08/01 11:42:41 INFO mapred.JobClient: map 100% reduce 0% 14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015 14/08/01 11:42:44 INFO mapred.JobClient: Counters: 39 14/08/01 11:42:44 INFO mapred.JobClient: File System Counters 14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of bytes read=0 14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of bytes written=369846 14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of read operations=0 14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of large read operations=0 14/08/01 11:42:44 INFO mapred.JobClient: FILE: Number of write operations=0 14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of bytes read=88 14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of bytes written=0 14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of read operations=2 14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of large read operations=0 14/08/01 11:42:44 INFO mapred.JobClient: HDFS: Number of write operations=1 14/08/01 11:42:44 INFO mapred.JobClient: Job Counters 14/08/01 11:42:44 INFO mapred.JobClient: Launched map tasks=2 14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=15772 14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/08/01 11:42:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/08/01 11:42:44 INFO mapred.JobClient: Map-Reduce Framework 14/08/01 11:42:44 INFO mapred.JobClient: Map input records=2 14/08/01 11:42:44 INFO mapred.JobClient: Map output records=0 14/08/01 11:42:44 INFO mapred.JobClient: Input split bytes=88 14/08/01 11:42:44 INFO mapred.JobClient: Spilled Records=0 14/08/01 11:42:44 INFO mapred.JobClient: CPU time spent (ms)=2230 14/08/01 11:42:44 INFO mapred.JobClient: Physical memory (bytes) snapshot=411357184 14/08/01 11:42:44 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2428895232 14/08/01 11:42:44 INFO mapred.JobClient: Total committed heap usage (bytes)=806027264 14/08/01 11:42:44 INFO mapred.JobClient: Giraph Stats 14/08/01 11:42:44 INFO mapred.JobClient: Aggregate edges=50 14/08/01 11:42:44 INFO mapred.JobClient: Aggregate finished vertices=50 14/08/01 11:42:44 INFO mapred.JobClient: Aggregate vertices=50 14/08/01 11:42:44 INFO mapred.JobClient: Current master task partition=0 14/08/01 11:42:44 INFO mapred.JobClient: Current workers=1 14/08/01 11:42:44 INFO mapred.JobClient: Last checkpointed superstep=0 14/08/01 11:42:44 INFO mapred.JobClient: Sent messages=0 14/08/01 11:42:44 INFO mapred.JobClient: Superstep=4 14/08/01 11:42:44 INFO mapred.JobClient: Giraph Timers 14/08/01 11:42:44 INFO mapred.JobClient: Input superstep (milliseconds)=238 14/08/01 11:42:44 INFO mapred.JobClient: Setup (milliseconds)=2903 14/08/01 11:42:44 INFO mapred.JobClient: Shutdown (milliseconds)=68 14/08/01 11:42:44 INFO mapred.JobClient: Superstep 0 (milliseconds)=77 14/08/01 11:42:44 INFO mapred.JobClient: Superstep 1 (milliseconds)=64 14/08/01 11:42:44 INFO mapred.JobClient: Superstep 2 (milliseconds)=45 14/08/01 11:42:44 INFO mapred.JobClient: Superstep 3 (milliseconds)=43 14/08/01 11:42:44 INFO mapred.JobClient: Total (milliseconds)=3442 

SimpleShortestPathsVertex输出

 14/08/01 11:47:37 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one. 14/08/01 11:47:37 INFO utils.ConfigurationUtils: Setting custom argument [SimpleShortestPathsVertex.source] to [2] in GiraphConfiguration 14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex index type is not known 14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex value type is not known 14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format edge value type is not known 14/08/01 11:47:37 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4) 14/08/01 11:47:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/08/01 11:47:38 INFO mapred.JobClient: Running job: job_201407291058_0017 14/08/01 11:47:39 INFO mapred.JobClient: map 0% reduce 0% 14/08/01 11:47:44 INFO mapred.JobClient: map 50% reduce 0% 14/08/01 11:47:45 INFO mapred.JobClient: map 100% reduce 0% 14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017 14/08/01 11:47:46 INFO mapred.JobClient: Counters: 39 14/08/01 11:47:46 INFO mapred.JobClient: File System Counters 14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of bytes read=0 14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of bytes written=367068 14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of read operations=0 14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of large read operations=0 14/08/01 11:47:46 INFO mapred.JobClient: FILE: Number of write operations=0 14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of bytes read=200 14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of bytes written=30 14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of read operations=5 14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of large read operations=0 14/08/01 11:47:46 INFO mapred.JobClient: HDFS: Number of write operations=2 14/08/01 11:47:46 INFO mapred.JobClient: Job Counters 14/08/01 11:47:46 INFO mapred.JobClient: Launched map tasks=2 14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=8538 14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0 14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 14/08/01 11:47:46 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 14/08/01 11:47:46 INFO mapred.JobClient: Map-Reduce Framework 14/08/01 11:47:46 INFO mapred.JobClient: Map input records=2 14/08/01 11:47:46 INFO mapred.JobClient: Map output records=0 14/08/01 11:47:46 INFO mapred.JobClient: Input split bytes=88 14/08/01 11:47:46 INFO mapred.JobClient: Spilled Records=0 14/08/01 11:47:46 INFO mapred.JobClient: CPU time spent (ms)=1590 14/08/01 11:47:46 INFO mapred.JobClient: Physical memory (bytes) snapshot=341344256 14/08/01 11:47:46 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2363527168 14/08/01 11:47:46 INFO mapred.JobClient: Total committed heap usage (bytes)=504758272 14/08/01 11:47:46 INFO mapred.JobClient: Giraph Stats 14/08/01 11:47:46 INFO mapred.JobClient: Aggregate edges=12 14/08/01 11:47:46 INFO mapred.JobClient: Aggregate finished vertices=5 14/08/01 11:47:46 INFO mapred.JobClient: Aggregate vertices=5 14/08/01 11:47:46 INFO mapred.JobClient: Current master task partition=0 14/08/01 11:47:46 INFO mapred.JobClient: Current workers=1 14/08/01 11:47:46 INFO mapred.JobClient: Last checkpointed superstep=0 14/08/01 11:47:46 INFO mapred.JobClient: Sent messages=0 14/08/01 11:47:46 INFO mapred.JobClient: Superstep=4 14/08/01 11:47:46 INFO mapred.JobClient: Giraph Timers 14/08/01 11:47:46 INFO mapred.JobClient: Input superstep (milliseconds)=181 14/08/01 11:47:46 INFO mapred.JobClient: Setup (milliseconds)=313 14/08/01 11:47:46 INFO mapred.JobClient: Shutdown (milliseconds)=128 14/08/01 11:47:46 INFO mapred.JobClient: Superstep 0 (milliseconds)=57 14/08/01 11:47:46 INFO mapred.JobClient: Superstep 1 (milliseconds)=54 14/08/01 11:47:46 INFO mapred.JobClient: Superstep 2 (milliseconds)=36 14/08/01 11:47:46 INFO mapred.JobClient: Superstep 3 (milliseconds)=35 14/08/01 11:47:46 INFO mapred.JobClient: Total (milliseconds)=805 

好的,看完hadoop脚本以及Hadoop和Giraph源代码后,我想我已经明白了。 大提示来自于使用带有Hadoop的libjars选项以及输出中的这一行:

WARN mapred.JobClient:使用GenericOptionsParser解析参数。 应用程序应该实现相同的工具。

原因似乎是GiraphRunner使用自己的ConfigurationUtils.parseArgs()来获取org.apache.commons.cli.CommandLine,而不是使用推荐的org.apache.hadoop.util.GenericOptionsParser.getCommandLine()来表示’ libjars的选择。 这让我回到了Hadoop的通用类路径处理工具:CLASSPATH和/或HADOOP_CLASSPATH。 这是有效的:

  • 使用冒号分隔符设置HADOOP_CLASSPATH以包含应用程序jar gigraph核心jar。
  • 传递-libjars使用相同的类路径但使用逗号分隔符。

例如,在我的机器上:

 $ export GIRAPH_HOME=/share/apps/giraph $ export HADOOP_CLASSPATH=/home//kdl_hadoop_play.jar:$GIRAPH_HOME/giraph-ex.jar:$HADOOP_CLASSPATH $ export LIBJARS=/home//kdl_hadoop_play.jar,$GIRAPH_HOME/giraph-core.jar $ hadoop fs -rm -R goutput/shortestpathsC2 $ hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \ -Dgiraph.zkList=:2181 \ -libjars ${LIBJARS} \ KdlSimpleShortestPathsVertex \ -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \ -vip /user/cornell/ginput/tiny_graph.txt \ -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ -op /user/cornell/goutput/shortestpathsC2 \ -ca SimpleShortestPathsVertex.source=2 \ -w 1 ... $ hadoop fs -cat goutput/shortestpathsC2/p* 

这给出了预期的输出和结果。

更一般地说,如果Giraph团队改变代码以使用(显然)更标准的解析器,将会很有帮助。

希望有所帮助!

我不知道为什么这不起作用,但有一种快速和肮脏的方法来解决这个问题。 尝试将代码放在giraph-examples/src/main/java/org/apache/giraph/examples/目录中(SimpleShortestPath所在的位置)。 然后通过运行mvn -DskipTests --projects giraph-examples --also-make package构建giraph-examples jar。 然后只需像运行SimpleShortestPath一样运行程序,用文件名替换SimpleShortestPath。 我希望有所帮助。