Table of Contents
This appendix describes the evolution of the HFile format.
As we will be discussing changes to the HFile format, it is useful to give a short overview of the original (HFile version 1) format.
An HFile in version 1 format is structured as follows:
The block index in version 1 is very straightforward. For each entry, it contains:
Offset (long)
Uncompressed size (int)
Key (a serialized byte array written using Bytes.writeByteArray)
Key length as a variable-length integer (VInt)
Key bytes
The number of entries in the block index is stored in the fixed file trailer, and has to be passed in to the method that reads the block index. One of the limitations of the block index in version 1 is that it does not provide the compressed size of a block, which turns out to be necessary for decompression. Therefore, the HFile reader has to infer this compressed size from the offset difference between blocks. We fix this limitation in version 2, where we store on-disk block size instead of uncompressed size, and get uncompressed size from the block header.