See: Description
Interface | Description |
---|---|
TableMap<K extends org.apache.hadoop.io.WritableComparable<? super K>,V> | Deprecated |
TableReduce<K extends org.apache.hadoop.io.WritableComparable,V> | Deprecated |
Class | Description |
---|---|
Driver | Deprecated |
GroupingTableMap | Deprecated |
HRegionPartitioner<K2,V2> | Deprecated |
IdentityTableMap | Deprecated |
IdentityTableReduce | Deprecated |
RowCounter | Deprecated |
TableInputFormat | Deprecated |
TableInputFormatBase | Deprecated |
TableMapReduceUtil | Deprecated |
TableOutputFormat | Deprecated |
TableRecordReader | Deprecated |
TableRecordReaderImpl | Deprecated |
TableSnapshotInputFormat |
TableSnapshotInputFormat allows a MapReduce job to run over a table snapshot.
|
TableSplit | Deprecated |
MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under $HBASE_CONF_DIR
nor to HBase classes.
You could add hbase-site.xml
to $HADOOP_HOME/conf and add
hbase-X.X.X.jar
to the $HADOOP_HOME/lib
and copy these
changes across your cluster but the cleanest means of adding hbase configuration
and classes to the cluster CLASSPATH
is by uncommenting
HADOOP_CLASSPATH
in $HADOOP_HOME/conf/hadoop-env.sh
adding hbase dependencies here. For example, here is how you would amend
hadoop-env.sh
adding the
built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
PerformanceEvaluation
class from the built hbase test jar to the
hadoop CLASSPATH
:
# Extra Java CLASSPATH elements. Optional. # export HADOOP_CLASSPATH= export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar
Expand $HBASE_HOME
in the above appropriately to suit your
local environment.
After copying the above change around your cluster (and restarting), this is how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce cluster):
The PerformanceEvaluation class wil be found on the CLASSPATH because you added$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4
$HBASE_HOME/build/test
to HADOOP_CLASSPATH
Another possibility, if for example you do not have access to hadoop-env.sh or
are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
job jar adding it and its dependencies under the job jar lib/
directory and the hbase conf into a job jar conf/
directory.
HBase can be used as a data source, TableInputFormat
,
and data sink, TableOutputFormat
, for MapReduce jobs.
Writing MapReduce jobs that read or write HBase, you'll probably want to subclass
TableMap
and/or
TableReduce
. See the do-nothing
pass-through classes IdentityTableMap
and
IdentityTableReduce
for basic usage. For a more
involved example, see BuildTableIndex
or review the org.apache.hadoop.hbase.mapred.TestTableMapReduce
unit test.
Running mapreduce jobs that have hbase as source or sink, you'll need to specify source/sink table and column names in your configuration.
Reading from hbase, the TableInputFormat asks hbase for the list of
regions and makes a map-per-region or mapred.map.tasks maps
,
whichever is smaller (If your job only has two maps, up mapred.map.tasks
to a number > number of regions). Maps will run on the adjacent TaskTracker
if you are running a TaskTracer and RegionServer per node.
Writing, it may make sense to avoid the reduce step and write yourself back into
hbase from inside your map. You'd do this when your job does not need the sort
and collation that mapreduce does on the map emitted data; on insert,
hbase 'sorts' so there is no point double-sorting (and shuffling data around
your mapreduce cluster) unless you need to. If you do not need the reduce,
you might just have your map emit counts of records processed just so the
framework's report at the end of your job has meaning or set the number of
reduces to zero and use TableOutputFormat. See example code
below. If running the reduce step makes sense in your case, its usually better
to have lots of reducers so load is spread across the hbase cluster.
There is also a new hbase partitioner that will run as many reducers as
currently existing regions. The
HRegionPartitioner
is suitable
when your table is large and your upload is not such that it will greatly
alter the number of existing regions when done; other use the default
partitioner.
See RowCounter
. You should be able to run
it by doing: % ./bin/hadoop jar hbase-X.X.X.jar
. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
offered. You may need to add the hbase conf directory to $HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs a mapreduce job to run concurrent clients reading and writing hbase.
Copyright © 2014 The Apache Software Foundation. All rights reserved.