在Hadoop Map Reduce中重命名部件文件

我已尝试按照页面http://hadoop.apache.org/docs/mapreduce/r0.21.0/api/index.html?org/apache/hadoop/mapreduce/lib/output/中的示例使用MultipleOutputs类MultipleOutputs.html

驱动程序代码

  Configuration conf = new Configuration(); Job job = new Job(conf, "Wordcount"); job.setJarByClass(WordCount.class); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(WordCountReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); MultipleOutputs.addNamedOutput(job, "text", TextOutputFormat.class, Text.class, IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1);

减速机代码

 public class WordCountReducer extends Reducer { private IntWritable result = new IntWritable(); private MultipleOutputs mos; public void setup(Context context){ mos = new MultipleOutputs(context); } public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); //context.write(key, result); mos.write("text", key,result); } public void cleanup(Context context) { try { mos.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

发现reducer的输出重命名为text-r-00000

但这里的问题是我也得到一个空的part-r-00000文件。这是预期MultipleOutputs的行为，还是我的代码有问题？请指教。

我尝试过的另一个替代方法是使用FileSystem类迭代我的输出文件夹，并手动重命名以part开头的所有文件。

什么是最好的方法？

 FileSystem hdfs = FileSystem.get(configuration); FileStatus fs[] = hdfs.listStatus(new Path(outputPath)); for (FileStatus aFile : fs) { if (aFile.isDir()) { hdfs.delete(aFile.getPath(), true); // delete all directories and sub-directories (if any) in the output directory } else { if (aFile.getPath().getName().contains("_")) hdfs.delete(aFile.getPath(), true); // delete all log files and the _SUCCESS file in the output directory else { hdfs.rename(aFile.getPath(), new Path(myCustomName)); } }

即使您正在使用MultipleOutputs ，默认的OutputFormat （我相信它是TextOutputFormat ）仍在使用，因此它将初始化并创建您看到的这些part-r-xxxxx文件。

它们是空的这一事实是因为您没有使用任何context.write因为您正在使用MultipleOutputs 。但这并不妨碍在初始化期间创建它们。

要摆脱它们，您需要定义OutputFormat以表示您不期望任何输出。你可以这样做：

 job.setOutputFormat(NullOutputFormat.class);

使用该属性集，这应该确保您的零件文件根本不会被初始化，但您仍然可以在MultipleOutputs获得输出。

您也可以使用LazyOutputFormat ，这将确保您的输出文件仅在/如果有某些数据时创建，而不是初始化空文件。你可以这样做：

 import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

请注意，您在Reducer中使用了原型MultipleOutputs.write(String namedOutput, K key, V value) ， MultipleOutputs.write(String namedOutput, K key, V value)使用将根据您的namedOutput生成的默认输出路径，如： {namedOutput}-(m|r)-{part-number} 。如果要对输出文件名进行更多控制，则应使用原型MultipleOutputs.write(String namedOutput, K key, V value, String baseOutputPath) ，这样可以根据键/值获取在运行时生成的文件名。

这是您在Driver类中需要做的就是更改输出文件的基本名称： job.getConfiguration().set("mapreduce.output.basename", "text"); 因此，这将导致您的文件被称为“text-r-00000”。

在Hadoop Map Reduce中重命名部件文件

Hadoop ClassNotFoundException

在Pig Latin中为每个组写一个文件

Datanode守护程序未在Hadoop 2.5.0上运行

Json使用Java反对Parquet格式而不转换为AVRO（不使用Spark，Hive，Pig，Impala）

如何将.txt / .csv文件转换为ORC格式

如何使用Java而不是XML使用hbase和Spring Boot？

通过JDBC连接到Hive时，Java NoSuchMethodError

mapreduce计数差异

ClassNotFoundException org.apache.mahout.math.VectorWritable

没有这样的方法exceptionHadoop