public interface Mapper<K1,V1,K2,V2> extends JobConfigurable, Closeable
Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
The Hadoop Map-Reduce framework spawns one map task for each
InputSplit
generated by the InputFormat
for the job.
Mapper
implementations can access the JobConf
for the
job via the JobConfigurable.configure(JobConf)
and initialize
themselves. Similarly they can use the Closeable.close()
method for
de-initialization.
The framework then calls
map(Object, Object, OutputCollector, Reporter)
for each key/value pair in the InputSplit
for that task.
All intermediate values associated with a given output key are
subsequently grouped by the framework, and passed to a Reducer
to
determine the final output. Users can control the grouping by specifying
a Comparator
via
JobConf.setOutputKeyComparatorClass(Class)
.
The grouped Mapper
outputs are partitioned per
Reducer
. Users can control which keys (and hence records) go to
which Reducer
by implementing a custom Partitioner
.
Users can optionally specify a combiner
, via
JobConf.setCombinerClass(Class)
, to perform local aggregation of the
intermediate outputs, which helps to cut down the amount of data transferred
from the Mapper
to the Reducer
.
The intermediate, grouped outputs are always stored in
SequenceFile
s. Applications can specify if and how the intermediate
outputs are to be compressed and which CompressionCodec
s are to be
used via the JobConf
.
If the job has
zero
reduces then the output of the Mapper
is directly written
to the FileSystem
without grouping by keys.
Example:
public class MyMapper<K extends WritableComparable, V extends Writable> extends MapReduceBase implements Mapper<K, V, K, V> { static enum MyCounters { NUM_RECORDS } private String mapTaskId; private String inputFile; private int noRecords = 0; public void configure(JobConf job) { mapTaskId = job.get("mapred.task.id"); inputFile = job.get("map.input.file"); } public void map(K key, V val, OutputCollector<K, V> output, Reporter reporter) throws IOException { // Process the <key, value> pair (assume this takes a while) // ... // ... // Let the framework know that we are alive, and kicking! // reporter.progress(); // Process some more // ... // ... // Increment the no. of <key, value> pairs processed ++noRecords; // Increment counters reporter.incrCounter(NUM_RECORDS, 1); // Every 100 records update application-level status if ((noRecords%100) == 0) { reporter.setStatus(mapTaskId + " processed " + noRecords + " from input-file: " + inputFile); } // Output the result output.collect(key, val); } }
Applications may write a custom MapRunnable
to exert greater
control on map processing e.g. multi-threaded Mapper
s etc.
void map(K1 key, V1 value, OutputCollector<K2,V2> output, Reporter reporter) throws IOException
Output pairs need not be of the same types as input pairs. A given
input pair may map to zero or many output pairs. Output pairs are
collected with calls to
OutputCollector.collect(Object,Object)
.
Applications can use the Reporter
provided to report progress
or just indicate that they are alive. In scenarios where the application
takes an insignificant amount of time to process individual key/value
pairs, this is crucial since the framework might assume that the task has
timed-out and kill that task. The other way of avoiding this is to set
mapred.task.timeout to a high-enough value (or even zero for no
time-outs).
key
- the input key.value
- the input value.output
- collects mapped keys and values.reporter
- facility to report progress.IOException
Copyright © 2009 The Apache Software Foundation