vendredi 27 février 2015

How can I split words by the second tab for MapReduce?

I'm doing MapReduces on some web data. (I'm new to MapReduce, so think classic WordCount type stuff.) The input file is as follows, numbers followed by a tabs:


3 2 2 4 2 2 2 3 3


While I understand how to get a classic 'word count' of the numbers, what I really want to do is evaluate the numbers in pairs, so the above would be read by the mapper as '3 2', '2 2', '2 4', '2 2', and so on. How do I do this? I suppose all that's necessary is to tweak the StringTokenizer to split words by second tab or something, but how would I do that? Is that even possible?


Here's the Java code I'm working with, which, as of now is just the classic WordCount example in MapReduce:



public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Aucun commentaire:

Enregistrer un commentaire