Note: this feature was introduced in HBase 0.98
Version 3 of HFile makes changes needed to ease management of encryption at rest and cell-level metadata (which in turn is needed for cell-level ACLs and cell-level visibility labels). For more information see Section 8.3.4, “Transparent Encryption of Data At Rest”, Section 8.3.1, “Tags”, Section 8.3.2, “Access Control Labels (ACLs)”, and ???.
The version of HBase introducing the above features reads HFiles in versions 1, 2, and 3 but only writes version 3 HFiles. Version 3 HFiles are structured the same as version 2 HFiles. For more information see Section H.2.2, “Overview of Version 2”.
Version 3 added two additional pieces of information to the reserved keys in the file info block.
hfile.MAX_TAGS_LEN |
The maximum number of bytes needed to store the serialized tags for any single cell in this hfile (int) |
hfile.TAGS_COMPRESSED |
Does the block encoder for this hfile compress tags? (boolean)
Should only be present if |
When reading a Version 3 HFile the presence of MAX_TAGS_LEN
is used
to determine how to deserialize the cells within a data block. Therefore, consumers must
read the file's info block prior to reading any data blocks.
When writing a Version 3 HFile, HBase will always include MAX_TAGS_LEN
when flushing the memstore to underlying filesystem and when using prefix tree
encoding for data blocks, as described in Appendix E, Compression and Data Block Encoding In
HBase. When compacting
extant files, the default writer will omit MAX_TAGS_LEN
if all of the
files selected do not themselves contain any cells with tags. See
Section 9.7.7.7, “Compaction” for details on the compaction file selection algorithm.
Within an HFile, HBase cells are stored in data blocks as a sequence of KeyValues (see Section H.1.1, “Overview of Version 1”, or Lars George's excellent introduction to HBase Storage). In version 3, these KeyValue optionally will include a set of 0 or more tags:
Version 1 & 2 Version 3 without MAX_TAGS_LEN | Version 3 with MAX_TAGS_LEN |
Key Length (4 bytes) | |
Value Length (4 bytes) | |
Key bytes (variable) | |
Value bytes (variable) | |
Tags Length (2 bytes) | |
Tags bytes (variable) |
If the info block for a given HFile contains an entry for
MAX_TAGS_LEN
each cell will have the length of that cell's tags
included, even if that length is zero. The actual tags are stored as a sequence of tag
length (2 bytes), tag type (1 byte), tag bytes (variable). The format an individual tag's
bytes depends on the tag type.
Note that the dependence on the contents of the info block implies that prior to reading
any data blocks you must first process a file's info block. It also implies that prior to
writing a data block you must know if the file's info block will include
MAX_TAGS_LEN
.
The fixed file trailers written with HFile version 3 are always serialized with protocol buffers. Additionally, it adds an optional field to the version 2 protocol buffer named encryption_key. If HBase is configured to encrypt HFiles this field will store a data encryption key for this particular HFile, encrypted with the current cluster master key using AES. For more information see Section 8.3.4, “Transparent Encryption of Data At Rest”.