Table of Contents
In HBase, data is stored in tables, which have rows and columns. This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
HBase Data Model Terminology
An HBase table consists of multiple rows.
A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted alphabetically by the row key as they are stored. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other. A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain.
A column in HBase consists of a column family and a column qualifier, which are
delimited by a :
(colon) character.
Column families physically colocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.
Column families are specified when you create your table, and influence the way your data is stored in the underlying filesystem. Therefore, the column families should be considered carefully during schema design.
A column qualifier is added to a column family to provide the index for a given
piece of data. Given a column family content
, a column qualifier
might be content:html
, and another might be
content:pdf
. Though column families are fixed at table creation,
column qualifiers are mutable and may differ greatly between rows.
A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value's version.
A cell's value is an uninterpreted array of bytes.
A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.
Direct manipulation of timestamps is an advanced feature which is only exposed for special cases that are deeply integrated with HBase, and is discouraged in general. Encoding a timestamp at the application level is the preferred pattern.
You can specify the maximum number of versions of a value that HBase retains, per column family. When the maximum number of versions is reached, the oldest versions are eventually deleted. By default, only the newest version is kept.
You can read a very understandable explanation of the HBase data model in the blog post Understanding HBase and BigTable by Jim R. Wilson. Another good explanation is available in the PDF Introduction to Basic Schema Design by Amandeep Khurana. It may help to read different perspectives to get a solid understanding of HBase schema design. The linked articles cover the same ground as the information in this section.
The following example is a slightly modified form of the one on page 2 of the BigTable paper. There
is a table called webtable
that contains two rows
(com.cnn.www
and com.example.www
), three column families named
contents
, anchor
, and people
. In
this example, for the first row (com.cnn.www
),
anchor
contains two columns (anchor:cssnsi.com
,
anchor:my.look.ca
) and contents
contains one column
(contents:html
). This example contains 5 versions of the row with the
row key com.cnn.www
, and one version of the row with the row key
com.example.www
. The contents:html
column qualifier contains the entire
HTML of a given website. Qualifiers of the anchor
column family each
contain the external site which links to the site represented by the row, along with the
text it used in the anchor of its link. The people
column family represents
people associated with the site.
By convention, a column name is made of its column family prefix and a
qualifier. For example, the column
contents:html is made up of the column family
contents
and the html
qualifier. The colon
character (:
) delimits the column family from the column family
qualifier.
Table 5.1. Table webtable
Row Key | Time Stamp | ColumnFamily contents | ColumnFamily anchor | ColumnFamily people |
---|---|---|---|---|
"com.cnn.www" | t9 | anchor:cnnsi.com = "CNN" | ||
"com.cnn.www" | t8 | anchor:my.look.ca = "CNN.com" | ||
"com.cnn.www" | t6 | contents:html = "<html>..." | ||
"com.cnn.www" | t5 | contents:html = "<html>..." | ||
"com.cnn.www" | t3 | contents:html = "<html>..." | ||
"com.example.www" | t5 | contents:html = "<html>..." | people:author = "John Doe" |
Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to look at data in HBase, or even the most accurate. The following represents the same information as a multi-dimensional map. This is only a mock-up for illustrative purposes and may not be strictly accurate.
{ "com.cnn.www": { contents: { t6: contents:html: "<html>..." t5: contents:html: "<html>..." t3: contents:html: "<html>..." } anchor: { t9: anchor:cnnsi.com = "CNN" t8: anchor:my.look.ca = "CNN.com" } people: {} } "com.example.www": { contents: { t5: contents:html: "<html>..." } anchor: {} people: { t5: people:author: "John Doe" } } }