Chapter 5. Data Model

Table of Contents

5.1. Conceptual View
5.2. Physical View
5.3. Namespace
5.3.1. Namespace management
5.3.2. Predefined namespaces
5.4. Table
5.5. Row
5.6. Column Family
5.7. Cells
5.8. Data Model Operations
5.8.1. Get
5.8.2. Put
5.8.3. Scans
5.8.4. Delete
5.9. Versions
5.9.1. Specifying the Number of Versions to Store
5.9.2. Versions and HBase Operations
5.9.3. Current Limitations
5.10. Sort Order
5.11. Column Metadata
5.12. Joins
5.13. ACID

In HBase, data is stored in tables, which have rows and columns. This is a terminology overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.

HBase Data Model Terminology

Table

An HBase table consists of multiple rows.

Row

A row in HBase consists of a row key and one or more columns with values associated with them. Rows are sorted alphabetically by the row key as they are stored. For this reason, the design of the row key is very important. The goal is to store data in such a way that related rows are near each other. A common row key pattern is a website domain. If your row keys are domains, you should probably store them in reverse (org.apache.www, org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each other in the table, rather than being spread out based on the first letter of the subdomain.

Column

A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character.

Column Family

Column families physically colocate a set of columns and their values, often for performance reasons. Each column family has a set of storage properties, such as whether its values should be cached in memory, how its data is compressed or its row keys are encoded, and others. Each row in a table has the same column families, though a given row might not store anything in a given column family.

Column families are specified when you create your table, and influence the way your data is stored in the underlying filesystem. Therefore, the column families should be considered carefully during schema design.

Column Qualifier

A column qualifier is added to a column family to provide the index for a given piece of data. Given a column family content, a column qualifier might be content:html, and another might be content:pdf. Though column families are fixed at table creation, column qualifiers are mutable and may differ greatly between rows.

Cell

A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value's version.

A cell's value is an uninterpreted array of bytes.

Timestamp

A timestamp is written alongside each value, and is the identifier for a given version of a value. By default, the timestamp represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell.

Caution

Direct manipulation of timestamps is an advanced feature which is only exposed for special cases that are deeply integrated with HBase, and is discouraged in general. Encoding a timestamp at the application level is the preferred pattern.

You can specify the maximum number of versions of a value that HBase retains, per column family. When the maximum number of versions is reached, the oldest versions are eventually deleted. By default, only the newest version is kept.

5.1. Conceptual View

You can read a very understandable explanation of the HBase data model in the blog post Understanding HBase and BigTable by Jim R. Wilson. Another good explanation is available in the PDF Introduction to Basic Schema Design by Amandeep Khurana. It may help to read different perspectives to get a solid understanding of HBase schema design. The linked articles cover the same ground as the information in this section.

The following example is a slightly modified form of the one on page 2 of the BigTable paper. There is a table called webtable that contains two rows (com.cnn.www and com.example.www), three column families named contents, anchor, and people. In this example, for the first row (com.cnn.www), anchor contains two columns (anchor:cssnsi.com, anchor:my.look.ca) and contents contains one column (contents:html). This example contains 5 versions of the row with the row key com.cnn.www, and one version of the row with the row key com.example.www. The contents:html column qualifier contains the entire HTML of a given website. Qualifiers of the anchor column family each contain the external site which links to the site represented by the row, along with the text it used in the anchor of its link. The people column family represents people associated with the site.

Column Names

By convention, a column name is made of its column family prefix and a qualifier. For example, the column contents:html is made up of the column family contents and the html qualifier. The colon character (:) delimits the column family from the column family qualifier.

Table 5.1. Table webtable

Row KeyTime StampColumnFamily contentsColumnFamily anchorColumnFamily people
"com.cnn.www"t9 anchor:cnnsi.com = "CNN" 
"com.cnn.www"t8 anchor:my.look.ca = "CNN.com" 
"com.cnn.www"t6contents:html = "<html>..."  
"com.cnn.www"t5contents:html = "<html>..."  
"com.cnn.www"t3contents:html = "<html>..."  
"com.example.www"t5contents:html = "<html>..." people:author = "John Doe"

Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to look at data in HBase, or even the most accurate. The following represents the same information as a multi-dimensional map. This is only a mock-up for illustrative purposes and may not be strictly accurate.

{
	"com.cnn.www": {
		contents: {
			t6: contents:html: "<html>..."
			t5: contents:html: "<html>..."
			t3: contents:html: "<html>..."
		}
		anchor: {
			t9: anchor:cnnsi.com = "CNN"
			t8: anchor:my.look.ca = "CNN.com"
		}
		people: {}
	}
	"com.example.www": {
		contents: {
			t5: contents:html: "<html>..."
		}
		anchor: {}
		people: {
			t5: people:author: "John Doe"
		}
	}
}        
        
comments powered by Disqus