Before HBase can use a given compressor, its libraries need to be available. Due to licensing issues, only GZ compression is available to HBase (via native Java libraries) in a default installation. Other compression libraries are available via the shared library bundled with your hadoop. The hadoop native library needs to be findable when HBase starts. See
A new configuration setting was introduced in HBase 0.95, to check the Master to
determine which data block encoders are installed and configured on it, and assume that
the entire cluster is configured the same. This option,
hbase.master.check.compression
, defaults to true
. This
prevents the situation described in HBASE-6370, where
a table is created or modified to support a codec that a region server does not support,
leading to failures that take a long time to occur and are difficult to debug.
If hbase.master.check.compression
is enabled, libraries for all desired
compressors need to be installed and configured on the Master, even if the Master does
not run a region server.
HBase uses Java's built-in GZip support unless the native Hadoop libraries are
available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
set the environment variable HBASE_LIBRARY_PATH
for the user running
HBase. If native libraries are not available and Java's GZIP is used, Got
brand-new compressor
reports will be present in the logs. See Section 15.9.2.10, “Logs flooded with '2011-01-10 12:40:48,407 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor' messages”).
HBase cannot ship with LZO because of incompatibility between HBase, which uses an Apache Software License (ASL) and LZO, which uses a GPL license. See the Using LZO Compression wiki page for information on configuring LZO support for HBase.
If you depend upon LZO compression, consider configuring your RegionServers to fail to start if LZO is not available. See Section E.3.1.7, “Enforce Compression Settings On a RegionServer”.
LZ4 support is bundled with Hadoop. Make sure the hadoop shared library (libhadoop.so) is accessible when you start HBase. After configuring your platform (see ???), you can make a symbolic link from HBase to the native Hadoop libraries. This assumes the two software installs are colocated. For example, if my 'platform' is Linux-amd64-64:
$ cd $HBASE_HOME $ mkdir lib/native $ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64
Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart) HBase. Afterward, you can create and alter tables to enable LZ4 as a compression codec.:
hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}
HBase does not ship with Snappy support because of licensing issues. You can install
Snappy binaries (for instance, by using yum install snappy on CentOS)
or build Snappy from source. After installing Snappy, search for the shared library,
which will be called libsnappy.so.X
where X is a number. If you
built from source, copy the shared library to a known location on your system, such as
/opt/snappy/lib/
.
In addition to the Snappy library, HBase also needs access to the Hadoop shared
library, which will be called something like libhadoop.so.X.Y
,
where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
it to the same location as the Snappy library.
The Snappy and Hadoop libraries need to be available on each node of your cluster. See Section E.3.1.6, “CompressionTest” to find out how to test that this is the case.
See Section E.3.1.7, “Enforce Compression Settings On a RegionServer” to configure your RegionServers to fail to start if a given compressor is not available.
Each of these library locations need to be added to the environment variable
HBASE_LIBRARY_PATH
for the operating system user that runs HBase. You
need to restart the RegionServer for the changes to take effect.
You can use the CompressionTest tool to verify that your compressor is available to HBase:
$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase
snappy
You can configure a RegionServer so that it will fail to restart if compression is
configured incorrectly, by adding the option hbase.regionserver.codecs to the
hbase-site.xml
, and setting its value to a comma-separated list
of codecs that need to be available. For example, if you set this property to
lzo,gz
, the RegionServer would fail to start if both compressors
were not available. This would prevent a new server from being added to the cluster
without having codecs configured properly.
To enable compression for a ColumnFamily, use an alter
command. You do
not need to re-create the table or copy data. If you are changing codecs, be sure the old
codec is still available until all the old StoreFiles have been compacted.
Example E.1. Enabling Compression on a ColumnFamily of an Existing Table using HBase Shell
hbase> disable 'test' hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'} hbase> enable 'test'
Example E.2. Creating a New Table with Compression On a ColumnFamily
hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }
Example E.3. Verifying a ColumnFamily's Compression Settings
hbase> describe 'test' DESCRIPTION ENABLED 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B LOCKCACHE => 'true'} 1 row(s) in 0.1070 seconds
HBase includes a tool called LoadTestTool which provides mechanisms to test your
compression performance. You must specify either -write
or
-update-read
as your first parameter, and if you do not specify another
parameter, usage advice is printed for each option.
Example E.4. LoadTestTool Usage
$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options> Options: -batchupdate Whether to use batch as opposed to separate updates for every column in a row -bloom <arg> Bloom filter type, one of [NONE, ROW, ROWCOL] -compression <arg> Compression type, one of [LZO, GZ, NONE, SNAPPY, LZ4] -data_block_encoding <arg> Encoding algorithm (e.g. prefix compression) to use for data blocks in the test column family, one of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE]. -encryption <arg> Enables transparent encryption on the test table, one of [AES] -generator <arg> The class which generates load for the tool. Any args for this class can be passed as colon separated after class name -h,--help Show usage -in_memory Tries to keep the HFiles of the CF inmemory as far as possible. Not guaranteed that reads are always served from inmemory -init_only Initialize the test table only, don't do any loading -key_window <arg> The 'key window' to maintain between reads and writes for concurrent write/read workload. The default is 0. -max_read_errors <arg> The maximum number of read errors to tolerate before terminating all reader threads. The default is 10. -multiput Whether to use multi-puts as opposed to separate puts for every column in a row -num_keys <arg> The number of keys to read/write -num_tables <arg> A positive integer number. When a number n is speicfied, load test tool will load n table parallely. -tn parameter value becomes table name prefix. Each table name is in format <tn>_1...<tn>_n -read <arg> <verify_percent>[:<#threads=20>] -regions_per_server <arg> A positive integer number. When a number n is specified, load test tool will create the test table with n regions per server -skip_init Skip the initialization; assume test table already exists -start_key <arg> The first key to read/write (a 0-based index). The default value is 0. -tn <arg> The name of the table to read or write -update <arg> <update_percent>[:<#threads=20>][:<#whether to ignore nonce collisions=0>] -write <arg> <avg_cols_per_key>:<avg_data_size>[:<#threads=20>] -zk <arg> ZK quorum as comma-separated host names without port numbers -zk_root <arg> name of parent znode in zookeeper
Example E.5. Example Usage of LoadTestTool
$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000 -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE