The following will describe some typical data ingestion use-cases with HBase, and how the rowkey design and construction can be approached. Note: this is just an illustration of potential approaches, not an exhaustive list. Know your data, and know your processing requirements.
It is highly recommended that you read the rest of the Chapter 6, HBase and Schema Design first, before reading these case studies.
The following case studies are described:
Log Data / Timeseries Data
Log Data / Timeseries on Steroids
Customer/Order
Tall/Wide/Middle Schema Design
List Data
Assume that the following data elements are being collected.
Hostname
Timestamp
Log event
Value/message
We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From these attributes the rowkey will be some combination of hostname, timestamp, and log-event - but what specifically?
The rowkey [timestamp][hostname][log-event]
suffers from the
monotonically increasing rowkey problem described in Section 6.3.2, “ Monotonically Increasing Row Keys/Timeseries Data ”.
There is another pattern frequently mentioned in the dist-lists about “bucketing” timestamps, by performing a mod operation on the timestamp. If time-oriented scans are important, this could be a useful approach. Attention must be paid to the number of buckets, because this will require the same number of scans to return results.
long bucket = timestamp % numBuckets;
… to construct:
[bucket][timestamp][hostname][log-event]
As stated above, to select data for a particular timerange, a Scan will need to be performed for each bucket. 100 buckets, for example, will provide a wide distribution in the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there are trade-offs.
The rowkey [hostname][log-event][timestamp]
is a candidate if there is a
large-ish number of hosts to spread the writes and reads across the keyspace. This
approach would be useful if scanning by hostname was a priority.
If the most important access path is to pull most recent events, then storing the
timestamps as reverse-timestamps (e.g., timestamp = Long.MAX_VALUE –
timestamp
) will create the property of being able to do a Scan on
[hostname][log-event]
to obtain the quickly obtain the most recently
captured events.
Neither approach is wrong, it just depends on what is most appropriate for the situation.
HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28boolean for more information.
It is critical to remember that rowkeys are stamped on every column in HBase. If the hostname is “a” and the event type is “e1” then the resulting rowkey would be quite small. However, what if the ingested hostname is “myserver1.mycompany.com” and the event type is “com.package1.subpackage2.subsubpackage3.ImportantService”?
It might make sense to use some substitution in the rowkey. There are at least two approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it might look like this:
Composite Rowkey With Hashes:
[MD5 hash of hostname] = 16 bytes
[MD5 hash of event-type] = 16 bytes
[timestamp] = 8 bytes
Composite Rowkey With Numeric Substitution:
For this approach another lookup table would be needed in addition to LOG_DATA, called LOG_TYPES. The rowkey of LOG_TYPES would be:
[type] (e.g., byte indicating hostname vs. event-type)
[bytes] variable length bytes for raw hostname or event-type.
A column for this rowkey could be a long with an assigned number, which could be obtained by using an HBase counter.
So the resulting composite rowkey would be:
[substituted long for hostname] = 8 bytes
[substituted long for event type] = 8 bytes
[timestamp] = 8 bytes
In either the Hash or Numeric substitution approach, the raw values for hostname and event-type can be stored as columns.
This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack rows into columns for certain time-periods. For a detailed explanation, see: http://opentsdb.net/schema.html, and Lessons Learned from OpenTSDB from HBaseCon2012.
But this is how the general concept works: data is ingested, for example, in this manner…
[hostname][log-event][timestamp1] [hostname][log-event][timestamp2] [hostname][log-event][timestamp3]
… with separate rowkeys for each detailed event, but is re-written like this…
[hostname][log-event][timerange]
… and each of the above events are converted into columns stored with a time-offset relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very advanced processing technique, but HBase makes this possible.
Assume that HBase is used to store customer and order information. There are two core record-types being ingested: a Customer record type, and Order record type.
The Customer record type would include all the things that you’d typically expect:
Customer number
Customer name
Address (e.g., city, state, zip)
Phone numbers, etc.
The Order record type would include things like:
Customer number
Order number
Sales date
A series of nested objects for shipping locations and line-items (see Section 6.11.3.2, “Order Object Design” for details)
Assuming that the combination of customer number and sales order uniquely identify an order, these two attributes will compose the rowkey, and specifically a composite key such as:
[customer number][order number]
… for a ORDER table. However, there are more design decisions to make: are the raw values the best choices for rowkeys?
The same design questions in the Log Data use-case confront us here. What is the keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it is advantageous to use fixed-length keys in HBase, as well as keys that can support a reasonable spread in the keyspace, similar options appear:
Composite Rowkey With Hashes:
[MD5 of customer number] = 16 bytes
[MD5 of order number] = 16 bytes
Composite Numeric/Hash Combo Rowkey:
[substituted long for customer number] = 8 bytes
[MD5 of order number] = 16 bytes
A traditional design approach would have separate tables for CUSTOMER and SALES. Another option is to pack multiple record types into a single table (e.g., CUSTOMER++).
Customer Record Type Rowkey:
[customer-id]
[type] = type indicating ‘1’ for customer record type
Order Record Type Rowkey:
[customer-id]
[type] = type indicating ‘2’ for order record type
[order]
The advantage of this particular CUSTOMER++ approach is that organizes many different record-types by customer-id (e.g., a single scan could get you everything about that customer). The disadvantage is that it’s not as easy to scan for a particular record-type.
Now we need to address how to model the Order object. Assume that the class structure is as follows:
(an Order can have multiple ShippingLocations
(a ShippingLocation can have multiple LineItems
... there are multiple options on storing this data.
With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and LINE_ITEM.
The ORDER table's rowkey was described above: Section 6.11.3, “Case Study - Customer/Order”
The SHIPPING_LOCATION's composite rowkey would be something like this:
[order-rowkey]
[shipping location number] (e.g., 1st location, 2nd, etc.)
The LINE_ITEM table's composite rowkey would be something like this:
[order-rowkey]
[shipping location number] (e.g., 1st location, 2nd, etc.)
[line item number] (e.g., 1st lineitem, 2nd, etc.)
Such a normalized model is likely to be the approach with an RDBMS, but that's not your only option with HBase. The cons of such an approach is that to retrieve information about any Order, you will need:
Get on the ORDER table for the Order
Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation instances
Scan on the LINE_ITEM for each ShippingLocation
... granted, this is what an RDBMS would do under the covers anyway, but since there are no joins in HBase you're just more aware of this fact.
With this approach, there would exist a single table ORDER that would contain
The Order rowkey was described above: Section 6.11.3, “Case Study - Customer/Order”
[order-rowkey]
[ORDER record type]
The ShippingLocation composite rowkey would be something like this:
[order-rowkey]
[SHIPPING record type]
[shipping location number] (e.g., 1st location, 2nd, etc.)
The LineItem composite rowkey would be something like this:
[order-rowkey]
[LINE record type]
[shipping location number] (e.g., 1st location, 2nd, etc.)
[line item number] (e.g., 1st lineitem, 2nd, etc.)
A variant of the Single Table With Record Types approach is to denormalize and flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes onto each LineItem instance.
The LineItem composite rowkey would be something like this:
[order-rowkey]
[LINE record type]
[line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that there are unique across the entire order)
... and the LineItem columns would be something like this:
itemNumber
quantity
price
shipToLine1 (denormalized from ShippingLocation)
shipToLine2 (denormalized from ShippingLocation)
shipToCity (denormalized from ShippingLocation)
shipToState (denormalized from ShippingLocation)
shipToZip (denormalized from ShippingLocation)
The pros of this approach include a less complex object heirarchy, but one of the cons is that updating gets more complicated in case any of this information changes.
With this approach, the entire Order object graph is treated, in one way or another, as a BLOB. For example, the ORDER table's rowkey was described above: Section 6.11.3, “Case Study - Customer/Order”, and a single column called "order" would contain an object that could be deserialized that contained a container Order, ShippingLocations, and LineItems.
There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables, etc. All of them are variants of the same approach: encode the object graph to a byte-array. Care should be taken with this approach to ensure backward compatibilty in case the object model changes such that older persisted structures can still be read back out of HBase.
Pros are being able to manage complex object graphs with minimal I/O (e.g., a single HBase Get per Order in this example), but the cons include the aforementioned warning about backward compatiblity of serialization, language dependencies of serialization (e.g., Java Serialization only works with Java clients), the fact that you have to deserialize the entire object to get any piece of information inside the BLOB, and the difficulty in getting frameworks like Hive to work with custom objects like this.
This section will describe additional schema design questions that appear on the dist-list, specifically about tall and wide tables. These are general guidelines and not laws - each application must consider its own needs.
A common question is whether one should prefer rows or HBase's built-in-versioning. The context is typically where there are "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 1 max versions). The rows-approach would require storing a timestamp in some portion of the rowkey so that they would not overwite with each successive update.
Preference: Rows (generally speaking).
Another common question is whether one should prefer rows or columns. The context is typically in extreme cases of wide tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.
Preference: Rows (generally speaking). To be clear, this guideline is in the context is in extremely wide cases, not in the standard use-case where one needs to store a few dozen or hundred columns. But there is also a middle path between these two options, and that is "Rows as Columns."
The middle path between Rows vs. Columns is packing data that would be a separate row into columns, for certain rows. OpenTSDB is the best example of this case where a single row represents a defined time-range, and then discrete events are treated as columns. This approach is often more complex, and may require the additional complexity of re-writing your data, but has the advantage of being I/O efficient. For an overview of this approach, see Section 6.11.2, “Case Study - Log Data and Timeseries Data on Steroids”.
The following is an exchange from the user dist-list regarding a fairly common question: how to handle per-user list data in Apache HBase.
*** QUESTION ***
We're looking at how to store a large amount of (per-user) list data in HBase, and we were trying to figure out what kind of access pattern made the most sense. One option is store the majority of the data in a key, so we could have something like:
<FixedWidthUserName><FixedWidthValueId1>:"" (no value) <FixedWidthUserName><FixedWidthValueId2>:"" (no value) <FixedWidthUserName><FixedWidthValueId3>:"" (no value)
The other option we had was to do this entirely using:
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>... <FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
where each row would contain multiple values. So in one case reading the first thirty values would be:
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
And in the second case it would be
get 'FixedWidthUserName\x00\x00\x00\x00'
The general usage pattern would be to read only the first 30 values of these lists, with infrequent access reading deeper into the lists. Some users would have <= 30 total values in these lists, and some users would have millions (i.e. power-law distribution)
The single-value format seems like it would take up more space on HBase, but would offer some improved retrieval / pagination flexibility. Would there be any significant performance advantages to be able to paginate via gets vs paginating with scans?
My initial understanding was that doing a scan should be faster if our paging size is unknown (and caching is set appropriately), but that gets should be faster if we'll always need the same page size. I've ended up hearing different people tell me opposite things about performance. I assume the page sizes would be relatively consistent, so for most use cases we could guarantee that we only wanted one page of data in the fixed-page-length case. I would also assume that we would have infrequent updates, but may have inserts into the middle of these lists (meaning we'd need to update all subsequent rows).
Thanks for help / suggestions / follow-up questions.
*** ANSWER ***
If I understand you correctly, you're ultimately trying to store triples in the form "user, valueid, value", right? E.g., something like:
"user123, firstname, Paul", "user234, lastname, Smith"
(But the usernames are fixed width, and the valueids are fixed width).
And, your access pattern is along the lines of: "for user X, list the next 30 values, starting with valueid Y". Is that right? And these values should be returned sorted by valueid?
The tl;dr version is that you should probably go with one row per user+value, and not build a complicated intra-row pagination scheme on your own unless you're really sure it is needed.
Your two options mirror a common question people have when designing HBase schemas: should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for one user, and so there are many rows in the table for each user; the row key is user + valueid, and there would be (presumably) a single column qualifier that means "the value". This is great if you want to scan over rows in sorted order by row key (thus my question above, about whether these ids are sorted correctly). You can start a scan at any user+valueid, read the next 30, and be done. What you're giving up is the ability to have transactional guarantees around all the rows for one user, but it doesn't sound like you need that. Doing it this way is generally recommended (see here http://hbase.apache.org/book.html#schema.smackdown).
Your second option is "wide": you store a bunch of values in one row, using different qualifiers (where the qualifier is the valueid). The simple way to do that would be to just store ALL values for one user in a single row. I'm guessing you jumped to the "paginated" version because you're assuming that storing millions of columns in a single row would be bad for performance, which may or may not be true; as long as you're not trying to do too much in a single request, or do things like scanning over and returning all of the cells in the row, it shouldn't be fundamentally worse. The client has methods that allow you to get specific slices of columns.
Note that neither case fundamentally uses more disk space than the other; you're just "shifting" part of the identifying information for a value either to the left (into the row key, in option one) or to the right (into the column qualifiers in option 2). Under the covers, every key/value still stores the whole row key, and column family name. (If this is a bit confusing, take an hour and watch Lars George's excellent video about understanding HBase schema design: http://www.youtube.com/watch?v=_HLoH_PgrLk).
A manually paginated version has lots more complexities, as you note, like having to keep track of how many things are in each page, re-shuffling if new values are inserted, etc. That seems significantly more complex. It might have some slight speed advantages (or disadvantages!) at extremely high throughput, and the only way to really know that would be to try it out. If you don't have time to build it both ways and compare, my advice would be to start with the simplest option (one row per user+value). Start simple and iterate! :)