Table of Contents
Apache MapReduce is a software framework used to analyze large amounts of data, and is the framework used most often with Apache Hadoop. MapReduce itself is out of the scope of this document. A good place to get started with MapReduce is http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html. MapReduce version 2 (MR2)is now part of YARN.
This chapter discusses specific configuration steps you need to take to use MapReduce on data within HBase. In addition, it discusses other interactions and issues between HBase and MapReduce jobs.
There are two mapreduce packages in HBase as in MapReduce itself: org.apache.hadoop.hbase.mapred
and org.apache.hadoop.hbase.mapreduce
. The former does old-style API and the latter
the new style. The latter has more facility though you can usually find an equivalent in the older
package. Pick the package that goes with your mapreduce deploy. When in doubt or starting over, pick the
org.apache.hadoop.hbase.mapreduce
. In the notes below, we refer to
o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
Ny default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
the HBase configuration under $HBASE_CONF_DIR
or the HBase classes.
To give the MapReduce jobs the access they need, you could add
hbase-site.xml
to the
directory and add the
HBase JARs to the $HADOOP_HOME
/conf/
directory, then copy these changes across your cluster. You could add hbase-site.xml to
$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
these changes across your cluster or edit
HADOOP_HOME
/conf/
and add
them to the $HADOOP_HOME
conf/hadoop-env.shHADOOP_CLASSPATH
variable. However, this approach is not
recommended because it will pollute your Hadoop install with HBase references. It also
requires you to restart the Hadoop cluster before Hadoop can use the HBase data.
Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
dependencies only need to be available on the local CLASSPATH. The following example runs
the bundled HBase RowCounter
MapReduce job against a table named usertable
If you have not set
the environment variables expected in the command (the parts prefixed by a
$
sign and curly braces), you can use the actual system paths instead.
Be sure to use the correct version of the HBase JAR for your system. The backticks
(`
symbols) cause ths shell to execute the sub-commands, setting the
CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell.
$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable
When the command runs, internally, the HBase JAR finds the dependencies it needs for
zookeeper, guava, and its other dependencies on the passed HADOOP_CLASSPATH
and adds the JARs to the MapReduce job configuration. See the source at
TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done.
The example may not work if you are running HBase from its build directory rather than an installed location. You may see an error like the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper
If this occurs, try modifying the command as follows, so that it uses the HBase JARs
from the target/
directory within the build environment.
$ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable
Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar to the following:
Exception in thread "main" java.lang.IllegalAccessError: class com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass com.google.protobuf.LiteralByteString at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:792) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100) ...
This is caused by an optimization introduced in HBASE-9867 that inadvertently introduced a classloader dependency.
This affects both jobs using the -libjars
option and "fat jar," those
which package their runtime dependencies in a nested lib
folder.
In order to satisfy the new classloader requirements, hbase-protocol.jar must be included in Hadoop's classpath. See Section 7.1, “HBase, MapReduce, and the CLASSPATH” for current recommendations for resolving classpath errors. The following is included for historical purposes.
This can be resolved system-wide by including a reference to the hbase-protocol.jar in hadoop's lib directory, via a symlink or by copying the jar into the new location.
This can also be achieved on a per-job launch basis by including it in the
HADOOP_CLASSPATH
environment variable at job submission time. When
launching jobs that package their dependencies, all three of the following job launching
commands satisfy this requirement:
$HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass
$HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass
For jars that do not package their dependencies, the following command structure is necessary:
$ HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')
...
See also HBASE-10304 for further discussion of this issue.