HBase can be used as a data source, TableInputFormat,
and data sink, TableOutputFormat
or MultiTableOutputFormat,
for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
subclass TableMapper
and/or TableReducer.
See the do-nothing pass-through classes IdentityTableMapper
and IdentityTableReducer
for basic usage. For a more involved example, see RowCounter
or review the org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
unit test.
If you run MapReduce jobs that use HBase as source or sink, need to specify source and sink table and column names in your configuration.
When you read from HBase, the TableInputFormat
requests the list of regions
from HBase and makes a map, which is either a map-per-region
or
mapreduce.job.maps
map, whichever is smaller. If your job only has two maps,
raise mapreduce.job.maps
to a number greater than the number of regions. Maps
will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
HBase from within your map. This approach works when your job does not need the sort and
collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
to. If you do not need the Reduce, you myour map might emit counts of records processed for
reporting at the end of the jobj, or set the number of Reduces to zero and use
TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
use multiple reducers so that load is spread across the HBase cluster.
A new HBase partitioner, the HRegionPartitioner, can run as many reducers the number of existing regions. The HRegionPartitioner is suitable when your table is large and your upload will not greatly alter the number of existing regions upon completion. Otherwise use the default partitioner.