org.apache.hadoop.mapred (Hadoop 1.2.2-SNAPSHOT API)

接口概要
接口	说明
AdminOperationsProtocol	Protocol for admin operations.
InputFormat<K,V>	`InputFormat` describes the input-specification for a Map-Reduce job.
InputSplit	`InputSplit` represents the data to be processed by an individual `Mapper`.
JobConfigurable	That what may be configured.
JobHistory.Listener	Callback interface for reading back log events from JobHistory.
JobTrackerMXBean	The MXBean interface for JobTrackerInfo
Mapper<K1,V1,K2,V2>	Maps input key/value pairs to a set of intermediate key/value pairs.
MapRunnable<K1,V1,K2,V2>	Expert: Generic interface for `Mapper`s.
OutputCollector<K,V>	Collects the `<key, value>` pairs output by `Mapper`s and `Reducer`s.
OutputFormat<K,V>	`OutputFormat` describes the output-specification for a Map-Reduce job.
Partitioner<K2,V2>	Partitions the key space.
RawKeyValueIterator	`RawKeyValueIterator` is an iterator used to iterate over the raw keys and values during sort/merge of intermediate data.
RecordReader<K,V>	`RecordReader` reads <key, value> pairs from an `InputSplit`.
RecordWriter<K,V>	`RecordWriter` writes the output <key, value> pairs to an output file.
Reducer<K2,V2,K3,V3>	Reduces a set of intermediate values which share a key to a smaller set of values.
Reporter	A facility for Map-Reduce applications to report progress and update counters, status information etc.
RunningJob	`RunningJob` is the user-interface to query for details on a running Map-Reduce job.
SequenceFileInputFilter.Filter	filter interface
TaskTrackerMXBean	MXBean interface for TaskTracker
TaskUmbilicalProtocol	Protocol that task child process uses to contact its parent process.

类概要
类	说明
CleanupQueue
ClusterStatus	Status information on the current state of the Map-Reduce cluster.
Counters	A set of named counters.
Counters.Counter	A counter record, comprising its name and value.
DefaultJobHistoryParser	Default parser for job history files.
DefaultTaskController	The default implementation for controlling tasks.
FileInputFormat<K,V>	A base class for file-based `InputFormat`.
FileOutputCommitter	An `OutputCommitter` that commits files specified in job output directory i.e.
FileOutputFormat<K,V>	A base class for `OutputFormat`.
FileSplit	A section of an input file.
HDFSMonitorThread
ID	A general identifier, which internally stores the id as an integer.
IsolationRunner	IsolationRunner is intended to facilitate debugging by re-running a specific task, given left-over task files for a (typically failed) past job.
JobClient	`JobClient` is the primary interface for the user-job to interact with the `JobTracker`.
JobClient.Renewer
JobConf	A map/reduce job configuration.
JobContext
JobEndNotifier
JobHistory	Provides methods for writing to and reading from job history.
JobHistory.HistoryCleaner	Delete history files older than one month.
JobHistory.JobInfo	Helper class for logging or reading back events related to job start, finish or failure.
JobHistory.MapAttempt	Helper class for logging or reading back events related to start, finish or failure of a Map Attempt on a node.
JobHistory.ReduceAttempt	Helper class for logging or reading back events related to start, finish or failure of a Map Attempt on a node.
JobHistory.Task	Helper class for logging or reading back events related to Task's start, finish or failure.
JobHistory.TaskAttempt	Base class for Map and Reduce TaskAttempts.
JobHistoryServer	`JobHistoryServer` is responsible for servicing all job history related requests from client.
JobID	JobID represents the immutable and unique identifier for the job.
JobInProgress	JobInProgress maintains all the info for keeping a Job on the straight and narrow.
JobLocalizer	Internal class responsible for initializing the job, not intended for users.
JobProfile	A JobProfile is a MapReduce primitive.
JobQueueInfo	Class that contains the information regarding the Job Queues which are maintained by the Hadoop Map/Reduce framework.
JobStatus	Describes the current status of a job.
JobTracker	JobTracker is the central location for submitting and tracking MR jobs in a network environment.
JvmTask
KeyValueLineRecordReader	This class treats a line in the input as a key/value pair separated by a separator character.
KeyValueTextInputFormat	An `InputFormat` for plain text files.
LineRecordReader	Treats keys as offset in file and value as line.
LineRecordReader.LineReader	已过时 Use `LineReader` instead.
LocalJobRunner	Implements MapReduce locally, in-process, for debugging.
MapFileOutputFormat	An `OutputFormat` that writes `MapFile`s.
MapReduceBase	Base class for `Mapper` and `Reducer` implementations.
MapReducePolicyProvider	`PolicyProvider` for Map-Reduce protocols.
MapRunner<K1,V1,K2,V2>	Default `MapRunnable` implementation.
MapTaskCompletionEventsUpdate	A class that represents the communication between the tasktracker and child tasks w.r.t the map task completion events.
MultiFileInputFormat<K,V>	已过时 Use `CombineFileInputFormat` instead
MultiFileSplit	已过时 Use `CombineFileSplit` instead
OutputCommitter	`OutputCommitter` describes the commit of task output for a Map-Reduce job.
OutputLogFilter	已过时 Use `Utils.OutputFileUtils.OutputLogFilter` instead.
QueueAclsInfo	Class to encapsulate Queue ACLs for a particular user.
RawHistoryFileServlet
SequenceFileAsBinaryInputFormat	InputFormat reading keys, values from SequenceFiles in binary (raw) format.
SequenceFileAsBinaryInputFormat.SequenceFileAsBinaryRecordReader	Read records from a SequenceFile as binary (raw) bytes.
SequenceFileAsBinaryOutputFormat	An `OutputFormat` that writes keys, values to `SequenceFile`s in binary(raw) format
SequenceFileAsBinaryOutputFormat.WritableValueBytes	Inner class used for appendRaw
SequenceFileAsTextInputFormat	This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method.
SequenceFileAsTextRecordReader	This class converts the input keys and values to their String forms by calling toString() method.
SequenceFileInputFilter<K,V>	A class that allows a map/red job to work on a sample of sequence files.
SequenceFileInputFilter.FilterBase	base class for Filters
SequenceFileInputFilter.MD5Filter	This class returns a set of records by examing the MD5 digest of its key against a filtering frequency f.
SequenceFileInputFilter.PercentFilter	This class returns a percentage of records The percentage is determined by a filtering frequency f using the criteria record# % f == 0.
SequenceFileInputFilter.RegexFilter	Records filter by matching key to regex
SequenceFileInputFormat<K,V>	An `InputFormat` for `SequenceFile`s.
SequenceFileOutputFormat<K,V>	An `OutputFormat` that writes `SequenceFile`s.
SequenceFileRecordReader<K,V>	An `RecordReader` for `SequenceFile`s.
ShuffleExceptionTracker	This class is used to track shuffle exceptions.
SkipBadRecords	Utility class for skip bad records functionality.
Task	Base class for tasks.
Task.CombineOutputCollector<K,V>	OutputCollector for the combiner.
Task.CombinerRunner<K,V>
Task.CombineValuesIterator<KEY,VALUE>
Task.NewCombinerRunner<K,V>
Task.OldCombinerRunner<K,V>
TaskAttemptContext
TaskAttemptID	TaskAttemptID represents the immutable and unique identifier for a task attempt.
TaskCompletionEvent	This is used to track task completion events on job tracker.
TaskController	Controls initialization, finalization and clean up of tasks, and also the launching and killing of task JVMs.
TaskGraphServlet	The servlet that outputs svg graphics for map / reduce task statuses
TaskID	TaskID represents the immutable and unique identifier for a Map or Reduce Task.
TaskLog	A simple logger to handle the task-specific user logs.
TaskLogAppender	A simple log4j-appender for the task child's map-reduce system logs.
TaskLogServlet	A servlet that is run by the TaskTrackers to provide the task logs via http.
TaskLogsTruncater	The class for truncating the user logs.
TaskReport	A report on the state of a task.
TaskStatus	Describes the current status of a task.
TaskTracker	TaskTracker is a process that starts and tracks MR Tasks in a networked environment.
TaskTracker.MapOutputServlet	This class is used in TaskTracker's Jetty to serve the map outputs to other nodes.
TaskTrackerMetricsSource	Instrumentation for metrics v2
TaskTrackerStatus	A TaskTrackerStatus is a MapReduce primitive.
TextInputFormat	An `InputFormat` for plain text files.
TextOutputFormat<K,V>	An `OutputFormat` that writes plain text files.
TextOutputFormat.LineRecordWriter<K,V>
UserLogCleaner	This is used only in UserLogManager, to manage cleanup of user logs.
Utils	A utility class.
Utils.OutputFileUtils
Utils.OutputFileUtils.OutputFilesFilter	This class filters output(part) files from the given directory It does not accept files with filenames _logs and _SUCCESS.
Utils.OutputFileUtils.OutputLogFilter	This class filters log files from directory given It doesnt accept paths having _logs.

枚举概要
枚举	说明
FileInputFormat.Counter
FileOutputFormat.Counter
JobClient.TaskStatusFilter
JobHistory.Keys	Job history files contain key="value" pairs, where keys belong to this enum.
JobHistory.RecordTypes	Record types are identifiers for each line of log in history files.
JobHistory.Values	This enum contains some of the values commonly used by history log events.
JobInProgress.Counter
JobPriority	Used to describe the priority of the running job.
JobTracker.SafeModeAction	JobTracker SafeMode
JobTracker.State
Operation	Generic operation that maps to the dependent set of ACLs that drive the authorization of the operation.
Task.Counter
TaskCompletionEvent.Status
TaskLog.LogName	The filter for userlogs.
TaskStatus.Phase
TaskStatus.State
TIPStatus	The states of a `TaskInProgress` as seen by the JobTracker.

异常错误概要
异常错误	说明
Counters.CountersExceededException	Counter exception thrown when the number of counters exceed the limit
FileAlreadyExistsException	Used when target file already exists for any operation and is not configured to be overwritten.
InvalidFileTypeException	Used when file type differs from the desired file type. like getting a file when a directory is expected.
InvalidInputException	This class wraps a list of problems with the input, so that the user can get a list of problems together instead of finding and fixing them one by one.
InvalidJobConfException	This exception is thrown when jobconf misses some mendatory attributes or value of some attributes is invalid.
JobTracker.IllegalStateException	A client tried to submit a job before the Job Tracker was ready.
JobTrackerNotYetInitializedException	This exception is thrown when the JobTracker is still initializing and not yet operational.
SafeModeException	This exception is thrown when the JobTracker is in safe mode.

程序包org.apache.hadoop.mapred的说明

A software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) parallelly on large clusters (thousands of nodes) built of commodity hardware in a reliable, fault-tolerant manner.

A Map-Reduce job usually splits the input data-set into independent chunks which processed by map tasks in completely parallel manner, followed by reduce tasks which aggregating their output. Typically both the input and the output of the job are stored in a FileSystem. The framework takes care of monitoring tasks and re-executing failed ones. Since, usually, the compute nodes and the storage nodes are the same i.e. Hadoop's Map-Reduce framework and Distributed FileSystem are running on the same set of nodes, tasks are effectively scheduled on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The Map-Reduce framework operates exclusively on <key, value> pairs i.e. the input to the job is viewed as a set of <key, value> pairs and the output as another, possibly different, set of <key, value> pairs. The keys and values have to be serializable as Writables and additionally the keys have to be WritableComparables in order to facilitate grouping by the framework.

Data flow:

Applications typically implement Mapper.map(Object, Object, OutputCollector, Reporter) and Reducer.reduce(Object, Iterator, OutputCollector, Reporter) methods. The application-writer also specifies various facets of the job such as input and output locations, the Partitioner, InputFormat & OutputFormat implementations to be used etc. as a JobConf. The client program, JobClient, then submits the job to the framework and optionally monitors it.

The framework spawns one map task per InputSplit generated by the InputFormat of the job and calls Mapper.map(Object, Object, OutputCollector, Reporter) with each <key, value> pair read by the RecordReader from the InputSplit for the task. The intermediate outputs of the maps are then grouped by keys and optionally aggregated by combiner. The key space of intermediate outputs are paritioned by the Partitioner, where the number of partitions is exactly the number of reduce tasks for the job.

The reduce tasks fetch the sorted intermediate outputs of the maps, via http, merge the <key, value> pairs and call Reducer.reduce(Object, Iterator, OutputCollector, Reporter) for each <key, list of values> pair. The output of the reduce tasks' is stored on the FileSystem by the RecordWriter provided by the OutputFormat of the job.

Map-Reduce application to perform a distributed grep:


public class Grep extends Configured implements Tool {

  // map: Search for the pattern specified by 'grep.mapper.regex' &
  //      'grep.mapper.regex.group'

  class GrepMapper<K, Text> 
  extends MapReduceBase  implements Mapper<K, Text, Text, LongWritable> {

    private Pattern pattern;
    private int group;

    public void configure(JobConf job) {
      pattern = Pattern.compile(job.get("grep.mapper.regex"));
      group = job.getInt("grep.mapper.regex.group", 0);
    }

    public void map(K key, Text value,
                    OutputCollector<Text, LongWritable> output,
                    Reporter reporter)
    throws IOException {
      String text = value.toString();
      Matcher matcher = pattern.matcher(text);
      while (matcher.find()) {
        output.collect(new Text(matcher.group(group)), new LongWritable(1));
      }
    }
  }

  // reduce: Count the number of occurrences of the pattern

  class GrepReducer<K> extends MapReduceBase
  implements Reducer<K, LongWritable, K, LongWritable> {

    public void reduce(K key, Iterator<LongWritable> values,
                       OutputCollector<K, LongWritable> output,
                       Reporter reporter)
    throws IOException {

      // sum all values for this key
      long sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }

      // output sum
      output.collect(key, new LongWritable(sum));
    }
  }
  
  public int run(String[] args) throws Exception {
    if (args.length < 3) {
      System.out.println("Grep <inDir> <outDir> <regex> [<group>]");
      ToolRunner.printGenericCommandUsage(System.out);
      return -1;
    }

    JobConf grepJob = new JobConf(getConf(), Grep.class);
    
    grepJob.setJobName("grep");

    FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
    FileOutputFormat.setOutputPath(grepJob, args[1]);

    grepJob.setMapperClass(GrepMapper.class);
    grepJob.setCombinerClass(GrepReducer.class);
    grepJob.setReducerClass(GrepReducer.class);

    grepJob.set("mapred.mapper.regex", args[2]);
    if (args.length == 4)
      grepJob.set("mapred.mapper.regex.group", args[3]);

    grepJob.setOutputFormat(SequenceFileOutputFormat.class);
    grepJob.setOutputKeyClass(Text.class);
    grepJob.setOutputValueClass(LongWritable.class);

    JobClient.runJob(grepJob);

    return 0;
  }

  public static void main(String[] args) throws Exception {
    int res = ToolRunner.run(new Configuration(), new Grep(), args);
    System.exit(res);
  }

}

Notice how the data-flow of the above grep job is very similar to doing the same via the unix pipeline:

cat input/*   |   grep   |   sort    |   uniq -c   >   out

      input   |    map   |  shuffle  |   reduce    >   out

Hadoop Map-Reduce applications need not be written in Java^TM only. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement Map-Reduce applications (non JNI^TM based).

See Google's original Map/Reduce paper for background information.

Java and JNI are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.

程序包 org.apache.hadoop.mapred

程序包org.apache.hadoop.mapred的说明