如何构建/运行这个简单的Mahout程序而不会出现exception?

我想运行我在Mahout In Action中找到的代码:

package org.help; import java.io.IOException; import java.util.ArrayList; import java.util.List; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.mahout.math.DenseVector; import org.apache.mahout.math.NamedVector; import org.apache.mahout.math.VectorWritable; public class SeqPrep { public static void main(String args[]) throws IOException{ List apples = new ArrayList(); NamedVector apple; apple = new NamedVector(new DenseVector(new double[]{0.11, 510, 1}), "small round green apple"); apples.add(apple); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path path = new Path("appledata/apples"); SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class); VectorWritable vec = new VectorWritable(); for(NamedVector vector : apples){ vec.set(vector); writer.append(new Text(vector.getName()), vec); } writer.close(); SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("appledata/apples"), conf); Text key = new Text(); VectorWritable value = new VectorWritable(); while(reader.next(key, value)){ System.out.println(key.toString() + " , " + value.get().asFormatString()); } reader.close(); } } 

我编译它:

 $ javac -classpath :/usr/local/hadoop-1.0.3/hadoop-core-1.0.3.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar:/home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-sources.jar -d myjavac/ SeqPrep.java 

我把它:

 $ jar -cvf SeqPrep.jar -C myjavac/ . 

现在我想在我的本地hadoop节点上运行它。 我试过了:

  hadoop jar SeqPrep.jar org.help.SeqPrep 

但我得到:

 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) 

所以我尝试使用libjars参数:

 $ hadoop jar SeqPrep.jar org.help.SeqPrep -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT.jar -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar -libjars /home/hduser/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-sources.jar -libjars /home/hduser/mahout/trunk/math/target/mahout-math-0.8-SNAPSHOT.jar -libjars /home/hduser/mahout/trunk/math/target/mahout-math-0.8-SNAPSHOT-sources.jar 

并得到了同样的问题。 我不知道还有什么可以尝试的。

我最终的目标是能够将hadoop fs上的.csv文件读入稀疏矩阵,然后将其乘以随机向量。

编辑:看起来像Razvan得到它(注意:请参阅下面的另一种方法,这样做不会弄乱你的hadoop安装)。 以供参考:

 $ find /usr/local/hadoop-1.0.3/. |grep mah /usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-tests.jar /usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT.jar /usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-job.jar /usr/local/hadoop-1.0.3/./lib/mahout-core-0.8-SNAPSHOT-sources.jar /usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT-sources.jar /usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT-tests.jar /usr/local/hadoop-1.0.3/./lib/mahout-math-0.8-SNAPSHOT.jar 

接着:

 $hadoop jar SeqPrep.jar org.help.SeqPrep small round green apple , small round green apple:{0:0.11,1:510.0,2:1.0} 

编辑:我试图这样做而不将mahoutjar子复制到hadoop lib /

 $ rm /usr/local/hadoop-1.0.3/lib/mahout-* 

然后当然:

 hadoop jar SeqPrep.jar org.help.SeqPrep Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/mahout/math/Vector at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.math.Vector at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) 

当我尝试mahout工作文件时:

 $hadoop jar ~/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep Exception in thread "main" java.lang.ClassNotFoundException: org.help.SeqPrep at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:149) 

如果我尝试包含我制作的.jar文件:

 $ hadoop jar ~/mahout/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar SeqPrep.jar org.help.SeqPrep Exception in thread "main" java.lang.ClassNotFoundException: SeqPrep.jar 

编辑:显然我一次只能发送一个jar子到hadoop。 这意味着我需要将我制作的类添加到mahout核心作业文件中:

 ~/mahout/trunk/core/target$ cp mahout-core-0.8-SNAPSHOT-job.jar mahout-core-0.8-SNAPSHOT-job.jar_backup ~/mahout/trunk/core/target$ cp ~/workspace/seqprep/bin/org/help/SeqPrep.class . ~/mahout/trunk/core/target$ jar uf mahout-core-0.8-SNAPSHOT-job.jar SeqPrep.class 

接着:

 ~/mahout/trunk/core/target$ hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep Exception in thread "main" java.lang.ClassNotFoundException: org.help.SeqPrep 

编辑:好的,现在我可以做到这一点,而不会搞乱我的hadoop安装。 我在之前的编辑中更新了.jar错误。 它应该是:

 ~/mahout/trunk/core/target$ jar uf mahout-core-0.8-SNAPSHOT-job.jar org/help/SeqPrep.class 

然后:

 ~/mahout/trunk/core/target$ hadoop jar mahout-core-0.8-SNAPSHOT-job.jar org.help.SeqPrep small round green apple , small round green apple:{0:0.11,1:510.0,2:1.0} 

您需要使用Mahout提供的“job”JAR文件。 它打包了所有依赖项。 您还需要将类添加到其中。 这就是所有Mahout示例的工作原理。 你不应该把Mahout jar放在Hadoop lib中,因为那种在Hadoop中“安装”程序太深了。

如果您将从https://github.com/tdunning/MiA存储库中获取示例代码,那么它包含可用于Maven的pom.xml文件。 当您使用mvn package编译代码时,它将在target目录中创建mia-0.1-job.jar – 此存档包含除Hadoop之外的所有依赖项,因此您可以在Hadoop集群上运行它而不会出现问题

  org.apache.mahout mahout-math 0.7   org.apache.mahout mahout-collections 1.0  

我做的是用我的jar和所有mahout jar文件设置HADOOP_CLASSPATH,如下所示。

export HADOOP_CLASSPATH = / home / xxx / my.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/mahout/mahout-core-0.7-cdh4.3.0.jar :/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/mahout/mahout-core-0.7-cdh4.3.0-job.jar中:/ opt / Cloudera公司/包裹/ CDH -4.3.0-1.cdh4.3.0.p0.22 / LIB /象夫/亨利马乌-例子-0.7-cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0。 P0.22 / LIB /象夫/亨利马乌-例子-0.7-cdh4.3.0-job.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/mahout/mahout – 整合 – 0.7 cdh4.3.0.jar:/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/mahout/mahout-math-0.7-cdh4.3.0.jar

然后我就能运行hadoop com.mycompany.mahout.CSVtoVector iris / nb / iris1.csv iris / nb / data / iris.seq

因此,您必须在HADOOP_CLASSPATH中包含所有jar子和mahout jar,然后您可以运行您的类
hadoop