14.6. Schema Design

14.6.1. Number of Column Families

See Section 6.2, “ On the number of column families ”.

14.6.2. Key and Attribute Lengths

See Section 6.3.3, “Try to minimize row and column sizes”. See also Section 14.6.7.1, “However...” for compression caveats.

14.6.3. Table RegionSize

The regionsize can be set on a per-table basis via setFileSize on HTableDescriptor in the event where certain tables require different regionsizes than the configured default regionsize.

See Section 17.9.2, “Determining region count and size” for more information.

14.6.4. Bloom Filters

A Bloom filter, named for its creator, Burton Howard Bloom, is a data structure which is designed to predict whether a given element is a member of a set of data. A positive result from a Bloom filter is not always accurate, but a negative result is guaranteed to be accurate. Bloom filters are designed to be "accurate enough" for sets of data which are so large that conventional hashing mechanisms would be impractical. For more information about Bloom filters in general, refer to http://en.wikipedia.org/wiki/Bloom_filter.

In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to contain the desired Row. The potential performance gain increases with the number of parallel reads.

The Bloom filters themselves are stored in the metadata of each HFile and never need to be updated. When an HFile is opened because a region is deployed to a RegionServer, the Bloom filter is loaded into memory.

HBase includes some tuning mechanisms for folding the Bloom filter to reduce the size and keep the false positive rate within a desired range.

Bloom filters were introduced in HBASE-1200. Since HBase 0.96, row-based Bloom filters are enabled by default. (HBASE-)

For more information on Bloom filters in relation to HBase, see Section 14.9.9, “Bloom Filters” for more information, or the following Quora discussion: How are bloom filters used in HBase?.

14.6.4.1. When To Use Bloom Filters

Since HBase 0.96, row-based Bloom filters are enabled by default. You may choose to disable them or to change some tables to use row+column Bloom filters, depending on the characteristics of your data and how it is loaded into HBase.

To determine whether Bloom filters could have a positive impact, check the value of blockCacheHitRatio in the RegionServer metrics. If Bloom filters are enabled, the value of blockCacheHitRatio should increase, because the Bloom filter is filtering out blocks that are definitely not needed.

You can choose to enable Bloom filters for a row or for a row+column combination. If you generally scan entire rows, the row+column combination will not provide any benefit. A row-based Bloom filter can operate on a row+column Get, but not the other way around. However, if you have a large number of column-level Puts, such that a row may be present in every StoreFile, a row-based filter will always return a positive result and provide no benefit. Unless you have one column per row, row+column Bloom filters require more space, in order to store more keys. Bloom filters work best when the size of each data entry is at least a few kilobytes in size.

Overhead will be reduced when your data is stored in a few larger StoreFiles, to avoid extra disk IO during low-level scans to find a specific row.

Bloom filters need to be rebuilt upon deletion, so may not be appropriate in environments with a large number of deletions.

14.6.4.2. Enabling Bloom Filters

Bloom filters are enabled on a Column Family. You can do this by using the setBloomFilterType method of HColumnDescriptor or using the HBase API. Valid values are NONE (the default), ROW, or ROWCOL. See Section 14.6.4.1, “When To Use Bloom Filters” for more information on ROW versus ROWCOL. See also the API documentation for HColumnDescriptor.

The following example creates a table and enables a ROWCOL Bloom filter on the colfam1 column family.

hbase> create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}          
        

14.6.4.3. Configuring Server-Wide Behavior of Bloom Filters

You can configure the following settings in the hbase-site.xml.

ParameterDefaultDescription

io.hfile.bloom.enabled

yes

Set to no to kill bloom filters server-wide if something goes wrong

io.hfile.bloom.error.rate

.01

The average false positive rate for bloom filters. Folding is used to maintain the false positive rate. Expressed as a decimal representation of a percentage.

io.hfile.bloom.max.fold

7

The guaranteed maximum fold rate. Changing this setting should not be necessary and is not recommended.

io.storefile.bloom.max.keys

128000000

For default (single-block) Bloom filters, this specifies the maximum number of keys.

io.storefile.delete.family.bloom.enabled

true

Master switch to enable Delete Family Bloom filters and store them in the StoreFile.

io.storefile.bloom.block.size

65536

Target Bloom block size. Bloom filter blocks of approximately this size are interleaved with data blocks.

hfile.block.bloom.cacheonwrite

false

Enables cache-on-write for inline blocks of a compound Bloom filter.

14.6.5. ColumnFamily BlockSize

The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k. Larger cell values require larger blocksizes. There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting indexes should be roughly halved).

See HColumnDescriptor and Section 9.7.7, “Store”for more information.

14.6.6. In-Memory ColumnFamilies

ColumnFamilies can optionally be defined as in-memory. Data is still persisted to disk, just like any other ColumnFamily. In-memory blocks have the highest priority in the Section 9.6.4, “Block Cache”, but it is not a guarantee that the entire table will be in memory.

See HColumnDescriptor for more information.

14.6.7. Compression

Production systems should use compression with their ColumnFamily definitions. See Appendix E, Compression and Data Block Encoding In HBase for more information.

14.6.7.1. However...

Compression deflates data on disk. When it's in-memory (e.g., in the MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated. So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names.

See Section 6.3.3, “Try to minimize row and column sizes” on for schema design tips, and Section 9.7.7.6, “KeyValue” for more information on HBase stores data internally.

comments powered by Disqus