A {row, column, version} tuple exactly specifies a
cell
in HBase. It's possible to have an unbounded number of cells where
the row and column are the same but the cell address differs only in its version
dimension.
While rows and column keys are expressed as bytes, the version is specified using a long
integer. Typically this long contains time instances such as those returned by
java.util.Date.getTime()
or System.currentTimeMillis()
, that is:
“the difference, measured in milliseconds, between the current time and midnight,
January 1, 1970 UTC”.
The HBase version dimension is stored in decreasing order, so that when reading from a store file, the most recent values are found first.
There is a lot of confusion over the semantics of cell
versions, in
HBase. In particular:
If multiple writes to a cell have the same version, only the last written is fetchable.
It is OK to write cells in a non-increasing version order.
Below we describe how the version dimension in HBase currently works. See HBASE-2406 for discussion of HBase versions. Bending time in HBase makes for a good read on the version, or time, dimension in HBase. It has more detail on versioning than is provided here. As of this writing, the limiitation Overwriting values at existing timestamps mentioned in the article no longer holds in HBase. This section is basically a synopsis of this article by Bruno Dumon.
The maximum number of versions to store for a given column is part of the column
schema and is specified at table creation, or via an alter command, via
HColumnDescriptor.DEFAULT_VERSIONS
. Prior to HBase 0.96, the default number
of versions kept was 3
, but in 0.96 and newer has been changed to
1
.
Example 5.3. Modify the Maximum Number of Versions for a Column
This example uses HBase Shell to keep a maximum of 5 versions of column
f1
. You could also use HColumnDescriptor.
hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5
Example 5.4. Modify the Minimum Number of Versions for a Column
You can also specify the minimum number of versions to store. By default, this is
set to 0, which means the feature is disabled. The following example sets the minimum
number of versions on field f1
to 2
, via HBase Shell.
You could also use HColumnDescriptor.
hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2
Starting with HBase 0.98.2, you can specify a global default for the maximum number of
versions kept for all newly-created columns, by setting
hbase.column.max.version
in hbase-site.xml
. See
hbase.column.max.version
.
In this section we look at the behavior of the version dimension for each of the core HBase operations.
Gets are implemented on top of Scans. The below discussion of Get applies equally to Scans.
By default, i.e. if you specify no explicit version, when doing a
get
, the cell whose version has the largest value is returned
(which may or may not be the latest one written, see later). The default behavior can be
modified in the following ways:
to return more than one version, see Get.setMaxVersions()
to return versions other than the latest, see Get.setTimeRange()
To retrieve the latest version that is less than or equal to a given value, thus giving the 'latest' state of the record at a certain point in time, just use a range from 0 to the desired version and set the max versions to 1.
The following Get will only retrieve the current version of the row
public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ... Get get = new Get(Bytes.toBytes("row1")); Result r = htable.get(get); byte[] b = r.getValue(CF, ATTR); // returns current version of value
The following Get will return the last 3 versions of the row.
public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ... Get get = new Get(Bytes.toBytes("row1")); get.setMaxVersions(3); // will return last 3 versions of row Result r = htable.get(get); byte[] b = r.getValue(CF, ATTR); // returns current version of value List<KeyValue> kv = r.getColumn(CF, ATTR); // returns all versions of this column
Doing a put always creates a new version of a cell
, at a certain
timestamp. By default the system uses the server's currentTimeMillis
,
but you can specify the version (= the long integer) yourself, on a per-column level.
This means you could assign a time in the past or the future, or use the long value for
non-time purposes.
To overwrite an existing value, do a put at exactly the same row, column, and version as that of the cell you would overshadow.
The following Put will be implicitly versioned by HBase with the current time.
public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ... Put put = new Put(Bytes.toBytes(row)); put.add(CF, ATTR, Bytes.toBytes( data)); htable.put(put);
The following Put has the version timestamp explicitly set.
public static final byte[] CF = "cf".getBytes(); public static final byte[] ATTR = "attr".getBytes(); ... Put put = new Put( Bytes.toBytes(row)); long explicitTimeInMs = 555; // just an example put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data)); htable.put(put);
Caution: the version timestamp is internally by HBase for things like time-to-live calculations. It's usually best to avoid setting this timestamp yourself. Prefer using a separate timestamp attribute of the row, or have the timestamp a part of the rowkey, or both.
There are three different types of internal delete markers. See Lars Hofhansl's blog for discussion of his attempt adding another, Scanning in HBase: Prefix Delete Marker.
Delete: for a specific version of a column.
Delete column: for all versions of a column.
Delete family: for all columns of a particular ColumnFamily
When deleting an entire row, HBase will internally create a tombstone for each ColumnFamily (i.e., not each individual column).
Deletes work by creating tombstone markers. For example, let's
suppose we want to delete a row. For this you can specify a version, or else by default
the currentTimeMillis
is used. What this means is “delete all
cells where the version is less than or equal to this version”. HBase never
modifies data in place, so for example a delete will not immediately delete (or mark as
deleted) the entries in the storage file that correspond to the delete condition.
Rather, a so-called tombstone is written, which will mask the
deleted values. When HBase does a major compaction, the tombstones are processed to
actually remove the dead values, together with the tombstones themselves. If the version
you specified when deleting a row is larger than the version of any value in the row,
then you can consider the complete row to be deleted.
For an informative discussion on how deletes and versioning interact, see the thread Put w/ timestamp -> Deleteall -> Put w/ timestamp fails up on the user mailing list.
Also see Section 9.7.7.6, “KeyValue” for more information on the internal KeyValue format.
Delete markers are purged during the next major compaction of the store, unless the
KEEP_DELETED_CELLS
option is set in the column family. To keep the
deletes for a configurable amount of time, you can set the delete TTL via the
hbase.hstore.time.to.purge.deletes
property in
hbase-site.xml
. If
hbase.hstore.time.to.purge.deletes
is not set, or set to 0, all
delete markers, including those with timestamps in the future, are purged during the
next major compaction. Otherwise, a delete marker with a timestamp in the future is kept
until the major compaction which occurs after the time represented by the marker's
timestamp plus the value of hbase.hstore.time.to.purge.deletes
, in
milliseconds.
This behavior represents a fix for an unexpected change that was introduced in HBase 0.94, and was fixed in HBASE-10118. The change has been backported to HBase 0.94 and newer branches.
Deletes mask puts, even puts that happened after the delete was entered. See HBASE-2256. Remember that a delete writes a tombstone, which only disappears after then next major compaction has run. Suppose you do a delete of everything <= T. After this you do a new put with a timestamp <= T. This put, even if it happened after the delete, will be masked by the delete tombstone. Performing the put will not fail, but when you do a get you will notice the put did have no effect. It will start working again after the major compaction has run. These issues should not be a problem if you use always-increasing versions for new puts to a row. But they can occur even if you do not care about time: just do delete and put immediately after each other, and there is some chance they happen within the same millisecond.
“...create three cell versions at t1, t2 and t3, with a maximum-versions setting of 2. So when getting all versions, only the values at t2 and t3 will be returned. But if you delete the version at t2 or t3, the one at t1 will appear again. Obviously, once a major compaction has run, such behavior will not be the case anymore...” (See Garbage Collection in Bending time in HBase.)