Chapter 7. HBase and MapReduce

Table of Contents

7.1. HBase, MapReduce, and the CLASSPATH
7.2. MapReduce Scan Caching
7.3. Bundled HBase MapReduce Jobs
7.4. HBase as a MapReduce Job Data Source and Data Sink
7.5. Writing HFiles Directly During Bulk Import
7.6. RowCounter Example
7.7. Map-Task Splitting
7.7.1. The Default HBase MapReduce Splitter
7.7.2. Custom Splitters
7.8. HBase MapReduce Examples
7.8.1. HBase MapReduce Read Example
7.8.2. HBase MapReduce Read/Write Example
7.8.3. HBase MapReduce Read/Write Example With Multi-Table Output
7.8.4. HBase MapReduce Summary to HBase Example
7.8.5. HBase MapReduce Summary to File Example
7.8.6. HBase MapReduce Summary to HBase Without Reducer
7.8.7. HBase MapReduce Summary to RDBMS
7.9. Accessing Other HBase Tables in a MapReduce Job
7.10. Speculative Execution

Apache MapReduce is a software framework used to analyze large amounts of data, and is the framework used most often with Apache Hadoop. MapReduce itself is out of the scope of this document. A good place to get started with MapReduce is http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. MapReduce version 2 (MR2)is now part of YARN.

This chapter discusses specific configuration steps you need to take to use MapReduce on data within HBase. In addition, it discusses other interactions and issues between HBase and MapReduce jobs.

mapred and mapreduce

There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred and org.apache.hadoop.hbase.mapreduce. The former does old-style API and the latter the new style. The latter has more facility though you can usually find an equivalent in the older package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the org.apache.hadoop.hbase.mapreduce. In the notes below, we refer to o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.

7.1. HBase, MapReduce, and the CLASSPATH

Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.

To give the MapReduce jobs the access they need, you could add hbase-site.xml to the $HADOOP_HOME/conf/ directory and add the HBase JARs to the HADOOP_HOME/conf/ directory, then copy these changes across your cluster. You could add hbase-site.xml to $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy these changes across your cluster or edit $HADOOP_HOMEconf/hadoop-env.sh and add them to the HADOOP_CLASSPATH variable. However, this approach is not recommended because it will pollute your Hadoop install with HBase references. It also requires you to restart the Hadoop cluster before Hadoop can use the HBase data.

Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The dependencies only need to be available on the local CLASSPATH. The following example runs the bundled HBase RowCounter MapReduce job against a table named usertable If you have not set the environment variables expected in the command (the parts prefixed by a $ sign and curly braces), you can use the actual system paths instead. Be sure to use the correct version of the HBase JAR for your system. The backticks (` symbols) cause ths shell to execute the sub-commands, setting the CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell.

$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable

When the command runs, internally, the HBase JAR finds the dependencies it needs for zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH and adds the JARs to the MapReduce job configuration. See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done.

Note

The example may not work if you are running HBase from its build directory rather than an installed location. You may see an error like the following:

java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper

If this occurs, try modifying the command as follows, so that it uses the HBase JARs from the target/ directory within the build environment.

$ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable

Notice to Mapreduce users of HBase 0.96.1 and above

Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar to the following:

Exception in thread "main" java.lang.IllegalAccessError: class
    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
    com.google.protobuf.LiteralByteString
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at
    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
    at
    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
...

This is caused by an optimization introduced in HBASE-9867 that inadvertently introduced a classloader dependency.

This affects both jobs using the -libjars option and "fat jar," those which package their runtime dependencies in a nested lib folder.

In order to satisfy the new classloader requirements, hbase-protocol.jar must be included in Hadoop's classpath. See Section 7.1, “HBase, MapReduce, and the CLASSPATH” for current recommendations for resolving classpath errors. The following is included for historical purposes.

This can be resolved system-wide by including a reference to the hbase-protocol.jar in hadoop's lib directory, via a symlink or by copying the jar into the new location.

This can also be achieved on a per-job launch basis by including it in the HADOOP_CLASSPATH environment variable at job submission time. When launching jobs that package their dependencies, all three of the following job launching commands satisfy this requirement:

$ HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$ HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
        

For jars that do not package their dependencies, the following command structure is necessary:

$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',') ...
        

See also HBASE-10304 for further discussion of this issue.

comments powered by Disqus