程序包 org.apache.hadoop.mapred.pipes

Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce.

请参阅: 说明

程序包org.apache.hadoop.mapred.pipes的说明

Hadoop Pipes allows C++ code to use Hadoop DFS and map/reduce. The primary approach is to split the C++ code into a separate process that does the application specific code. In many ways, the approach will be similar to Hadoop streaming, but using Writable serialization to convert the types into bytes that are sent to the process via a socket.

The class org.apache.hadoop.mapred.pipes.Submitter has a public static method to submit a job as a JobConf and a main method that takes an application and optional configuration file, input directories, and output directory. The cli for the main looks like:

bin/hadoop pipes \
  [-input inputDir] \
  [-output outputDir] \
  [-jar applicationJarFile] \
  [-inputformat class] \
  [-map class] \
  [-partitioner class] \
  [-reduce class] \
  [-writer class] \
  [-program program url] \ 
  [-conf configuration file] \
  [-D property=value] \
  [-fs local|namenode:port] \
  [-jt local|jobtracker:port] \
  [-files comma separated list of files] \ 
  [-libjars comma separated list of jars] \
  [-archives comma separated list of archives] 

The application programs link against a thin C++ wrapper library that handles the communication with the rest of the Hadoop system. The C++ interface is "swigable" so that interfaces can be generated for python and other scripting languages. All of the C++ functions and classes are in the HadoopPipes namespace. The job may consist of any combination of Java and C++ RecordReaders, Mappers, Paritioner, Combiner, Reducer, and RecordWriter.

Hadoop Pipes has a generic Java class for handling the mapper and reducer (PipesMapRunner and PipesReducer). They fork off the application program and communicate with it over a socket. The communication is handled by the C++ wrapper library and the PipesMapRunner and PipesReducer.

The application program passes in a factory object that can create the various objects needed by the framework to the runTask function. The framework creates the Mapper or Reducer as appropriate and calls the map or reduce method to invoke the application's code. The JobConf is available to the application.

The Mapper and Reducer objects get all of their inputs, outputs, and context via context objects. The advantage of using the context objects is that their interface can be extended with additional methods without breaking clients. Although this interface is different from the current Java interface, the plan is to migrate the Java interface in this direction.

Although the Java implementation is typed, the C++ interfaces of keys and values is just a byte buffer. Since STL strings provide precisely the right functionality and are standard, they will be used. The decision to not use stronger types was to simplify the interface.

The application can also define combiner functions. The combiner will be run locally by the framework in the application process to avoid the round trip to the Java process and back. Because the compare function is not available in C++, the combiner will use memcmp to sort the inputs to the combiner. This is not as general as the Java equivalent, which uses the user's comparator, but should cover the majority of the use cases. As the map function outputs key/value pairs, they will be buffered. When the buffer is full, it will be sorted and passed to the combiner. The output of the combiner will be sent to the Java process.

The application can also set a partition function to control which key is given to a particular reduce. If a partition function is not defined, the Java one will be used. The partition function will be called by the C++ framework before the key/value pair is sent back to Java.

The application programs can also register counters with a group and a name and also increment the counters and get the counter values. Word-count example illustrating pipes usage with counters is available at wordcount-simple.cc

Copyright © 2009 The Apache Software Foundation