Appendix E. Compression and Data Block Encoding In HBase

Table of Contents

E.1. Which Compressor or Data Block Encoder To Use
E.2. Making use of Hadoop Native Libraries in HBase
E.3. Compressor Configuration, Installation, and Use
E.3.1. Configure HBase For Compressors
E.3.2. Enable Compression On a ColumnFamily
E.3.3. Testing Compression Performance
E.4. Enable Data Block Encoding

Note

Codecs mentioned in this section are for encoding and decoding data blocks or row keys. For information about replication codecs, see Preserving Tags During Replication.

Some of the information in this section is pulled from a discussion on the HBase Development mailing list.

HBase supports several different compression algorithms which can be enabled on a ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in cells, and can significantly reduce the storage space needed to store uncompressed data.

Compressors and data block encoding can be used together on the same ColumnFamily.

Changes Take Effect Upon Compaction. If you change compression or encoding for a ColumnFamily, the changes take effect during compaction.

Some codecs take advantage of capabilities built into Java, such as GZip compression. Others rely on native libraries. Native libraries may be available as part of Hadoop, such as LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs, such as Google Snappy, need to be installed first. Some codecs are licensed in ways that conflict with HBase's license and cannot be shipped as part of HBase.

This section discusses common codecs that are used and tested with HBase. No matter what codec you use, be sure to test that it is installed correctly and is available on all nodes in your cluster. Extra operational steps may be necessary to be sure that codecs are available on newly-deployed nodes. You can use the Section E.3.1.6, “CompressionTest” utility to check that a given codec is correctly installed.

To configure HBase to use a compressor, see Section E.3.1, “Configure HBase For Compressors”. To enable a compressor for a ColumnFamily, see Section E.3.2, “Enable Compression On a ColumnFamily”. To enable data block encoding for a ColumnFamily, see Section E.4, “Enable Data Block Encoding”.

Block Compressors

Data Block Encoding Types

E.1. Which Compressor or Data Block Encoder To Use

The compression or codec type to use depends on the characteristics of your data. Choosing the wrong type could cause your data to take more space rather than less, and can have performance implications. In general, you need to weigh your options between smaller size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at Documenting Guidance on compression and codecs.

  • If you have long keys (compared to the values) or many columns, use a prefix encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree encoding.

  • If the values are large (and not precompressed, such as images), use a data block compressor.

  • Use GZIP for cold data, which is accessed infrequently. GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.

  • Use Snappy or LZO for hot data, which is accessed frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high of a compression ratio.

  • In most cases, enabling Snappy or LZO by default is a good choice, because they have a low performance overhead and provide space savings.

  • Before Snappy became available by Google in 2011, LZO was the default. Snappy has similar qualities as LZO but has been shown to perform better.

comments powered by Disqus