H.3. HBase File Format with Security Enhancements (version 3)

Note: this feature was introduced in HBase 0.98

H.3.1. Motivation

Version 3 of HFile makes changes needed to ease management of encryption at rest and cell-level metadata (which in turn is needed for cell-level ACLs and cell-level visibility labels). For more information see Section 8.3.4, “Transparent Encryption of Data At Rest”, Section 8.3.1, “Tags”, Section 8.3.2, “Access Control Labels (ACLs)”, and ???.

H.3.2. Overview

The version of HBase introducing the above features reads HFiles in versions 1, 2, and 3 but only writes version 3 HFiles. Version 3 HFiles are structured the same as version 2 HFiles. For more information see Section H.2.2, “Overview of Version 2”.

H.3.3. File Info Block in Version 3

Version 3 added two additional pieces of information to the reserved keys in the file info block.

hfile.MAX_TAGS_LEN

The maximum number of bytes needed to store the serialized tags for any single cell in this hfile (int)

hfile.TAGS_COMPRESSED

Does the block encoder for this hfile compress tags? (boolean)

Should only be present if hfile.MAX_TAGS_LEN is also present.

When reading a Version 3 HFile the presence of MAX_TAGS_LEN is used to determine how to deserialize the cells within a data block. Therefore, consumers must read the file's info block prior to reading any data blocks.

When writing a Version 3 HFile, HBase will always include MAX_TAGS_LEN when flushing the memstore to underlying filesystem and when using prefix tree encoding for data blocks, as described in Appendix E, Compression and Data Block Encoding In HBase. When compacting extant files, the default writer will omit MAX_TAGS_LEN if all of the files selected do not themselves contain any cells with tags. See Section 9.7.7.7, “Compaction” for details on the compaction file selection algorithm.

H.3.4. Data Blocks in Version 3

Within an HFile, HBase cells are stored in data blocks as a sequence of KeyValues (see Section H.1.1, “Overview of Version 1”, or Lars George's excellent introduction to HBase Storage). In version 3, these KeyValue optionally will include a set of 0 or more tags:

Version 1 & 2

Version 3 without MAX_TAGS_LEN

Version 3 with MAX_TAGS_LEN

Key Length (4 bytes)

Value Length (4 bytes)

Key bytes (variable)

Value bytes (variable)

 

Tags Length (2 bytes)

 

Tags bytes (variable)

If the info block for a given HFile contains an entry for MAX_TAGS_LEN each cell will have the length of that cell's tags included, even if that length is zero. The actual tags are stored as a sequence of tag length (2 bytes), tag type (1 byte), tag bytes (variable). The format an individual tag's bytes depends on the tag type.

Note that the dependence on the contents of the info block implies that prior to reading any data blocks you must first process a file's info block. It also implies that prior to writing a data block you must know if the file's info block will include MAX_TAGS_LEN.

H.3.5. Fixed File Trailer in Version 3

The fixed file trailers written with HFile version 3 are always serialized with protocol buffers. Additionally, it adds an optional field to the version 2 protocol buffer named encryption_key. If HBase is configured to encrypt HFiles this field will store a data encryption key for this particular HFile, encrypted with the current cluster master key using AES. For more information see Section 8.3.4, “Transparent Encryption of Data At Rest”.

comments powered by Disqus