文章/答案/技术大牛

发布

社区首页 >问答首页 >文本文件的字数统计

问文本文件的字数统计
EN

Stack Overflow用户

提问于 2015-08-02 10:31:18

回答 1查看 297关注 0票数 0

我想用Hadoop MapReduce分析一个文本文件。

CVS文件更容易分析，因为它可以用'，‘区分列

但是文本文件不能像CVS文件那样被区分。

这是一种文本文件格式。

2015-8-02

error2014 blahblahblahblah

2015-8-02

blahblahbalh error2014

我想要一个输出为

date      contents  sum of errors

2015-8-02  error2014  2

我想这样分析。我应该如何处理MapReduce程序。

hadoop

mapreduce

回答 1

Stack Overflow用户

发布于 2015-08-02 15:37:31

假设您具有以下格式的文本文件：

2015-8-02

error2014废话废话

2015-8-02

blahblahbalh error2014

您可以使用NLineInputFormat。

使用NLineInputFormat功能，您可以精确地指定应将多少行转到映射器。

在您的例子中，您可以使用为每个映射器输入2行。

编辑

下面是一个使用NLineInputFormat的示例：

映射器类：

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> {

    @Override
    public void map(LongWritable key, Text value, Context context)
          throws IOException, InterruptedException {

        context.write(key, value);
    }

}

驱动程序类：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Driver extends Configured implements Tool {

    @Override
    public int run(String[] args) throws Exception {

        if (args.length != 2) {
            System.out
                  .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n");
            return -1;
        }

        Job job = new Job(getConf());
        job.setJobName("NLineInputFormat example");
        job.setJarByClass(Driver.class);

        job.setInputFormatClass(NLineInputFormat.class);
        NLineInputFormat.addInputPath(job, new Path(args[0]));
        job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 2);

        LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(MapperNLine.class);
        job.setNumReduceTasks(0);

        boolean success = job.waitForCompletion(true);
        return success ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        int exitCode = ToolRunner.run(new Configuration(), new Driver(), args);
        System.exit(exitCode);
    }
}

然后，您可以从这些行中提取日期和错误。在提取日期和错误之后，您可以将它们作为组合键或作为键和IntWritable值的串联字符串传递，就像WordCount示例一样，然后在reducer类中执行基本的相加操作，类似于WordCount示例。

我希望我能回答你的问题。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/31768048

复制

相似问题

问文本文件的字数统计
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文本文件的字数统计EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问文本文件的字数统计
EN