9.6. RegionServer

9.6. RegionServer
Prev	Chapter 9. Architecture	Next

9.6.1. Interface

The methods exposed by HRegionRegionInterface contain both data-oriented and region-maintenance methods:

Data (get, put, delete, next, etc.)
Region (splitRegion, compactRegion, etc.)

For example, when the HBaseAdmin method majorCompact is invoked on a table, the client is actually iterating through all regions for the specified table and requesting a major compaction directly to each region.

9.6.2. Processes

The RegionServer runs a variety of background threads:

9.6.2.1. CompactSplitThread

Checks for splits and handle minor compactions.

9.6.2.2. MajorCompactionChecker

Checks for major compactions.

9.6.2.3. MemStoreFlusher

Periodically flushes in-memory writes in the MemStore to StoreFiles.

9.6.2.4. LogRoller

Periodically checks the RegionServer's HLog.

9.6.3. Coprocessors

Coprocessors were added in 0.92. There is a thorough Blog Overview of CoProcessors posted. Documentation will eventually move to this reference guide, but the blog is the most current information available at this time.

9.6.4. Block Cache

HBase provides two different BlockCache implementations: the default onheap LruBlockCache and BucketCache, which is (usually) offheap. This section discusses benefits and drawbacks of each implementation, how to choose the appropriate option, and configuration options for each.

Block Cache Reporting: UI

See the RegionServer UI for detail on caching deploy. Since HBase-0.98.4, the Block Cache detail has been significantly extended showing configurations, sizings, current usage, time-in-the-cache, and even detail on block counts and types.

9.6.4.1. Cache Choices

LruBlockCache is the original implementation, and is entirely within the Java heap. BucketCache is mainly intended for keeping blockcache data offheap, although BucketCache can also keep data onheap and serve from a file-backed cache.

BucketCache is production ready as of hbase-0.98.6

To run with BucketCache, you need HBASE-11678. This was included in hbase-0.98.6.

Fetching will always be slower when fetching from BucketCache, as compared to the native onheap LruBlockCache. However, latencies tend to be less erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC. If the BucketCache is deployed in offheap mode, this memory is not managed by the GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs and heap fragmentation. See Nick Dimiduk's BlockCache 101 for comparisons running onheap vs offheap tests. Also see Comparing BlockCache Deploys which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache.

When you enable BucketCache, you are enabling a two tier caching system, an L1 cache which is implemented by an instance of LruBlockCache and an offheap L2 cache which is implemented by BucketCache. Management of these two tiers and the policy that dictates how blocks move between them is done by CombinedBlockCache. It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and BLOOM blocks -- onheap in the L1 LruBlockCache. See Section 9.6.4.5, “Offheap Block Cache” for more detail on going offheap.

9.6.4.2. General Cache Configurations

Apart from the cache implementation itself, you can set some general configuration options to control how the cache performs. See http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html. After setting any of these options, restart or rolling restart your cluster for the configuration to take effect. Check logs for errors or unexpected behavior.

See also Section 14.4.4, “Prefetch Option for Blockcache”, which discusses a new option introduced in HBASE-9857.

9.6.4.3. LruBlockCache Design

The LruBlockCache is an LRU cache that contains three levels of block priority to allow for scan-resistance and in-memory ColumnFamilies:

Single access priority: The first time a block is loaded from HDFS it normally has this priority and it will be part of the first group to be considered during evictions. The advantage is that scanned blocks are more likely to get evicted than blocks that are getting more usage.
Mutli access priority: If a block in the previous priority group is accessed again, it upgrades to this priority. It is thus part of the second group considered during evictions.
In-memory access priority: If the block's family was configured to be "in-memory", it will be part of this priority disregarding the number of times it was accessed. Catalog tables are configured like this. This group is the last one considered during evictions.
To mark a column family as in-memory, call
```
HColumnDescriptor.setInMemory(true);
```
if creating a table from java, or set IN_MEMORY => true when creating or altering a table in the shell: e.g.
```
hbase(main):003:0> create  't', {NAME => 'f', IN_MEMORY => 'true'}
```

For more information, see the LruBlockCache source

9.6.4.4. LruBlockCache Usage

Block caching is enabled by default for all the user tables which means that any read operation will load the LRU cache. This might be good for a large number of use cases, but further tunings are usually required in order to achieve better performance. An important concept is the working set size, or WSS, which is: "the amount of memory needed to compute the answer to a problem". For a website, this would be the data that's needed to answer the queries over a short amount of time.

The way to calculate how much memory is available in HBase for caching is:

            number of region servers * heap size * hfile.block.cache.size * 0.99

The default value for the block cache is 0.25 which represents 25% of the available heap. The last value (99%) is the default acceptable loading factor in the LRU cache after which eviction is started. The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks. Here are some examples:

One region server with the default heap size (1 GB) and the default block cache size will have 253 MB of block cache available.
20 region servers with the heap size set to 8 GB and a default block cache size will have 39.6 of block cache.
100 region servers with the heap size set to 24 GB and a block cache size of 0.5 will have about 1.16 TB of block cache.

Your data is not the only resident of the block cache. Here are others that you may have to take into account:

Catalog Tables: The -ROOT- (prior to HBase 0.96. See Section 9.2.1, “-ROOT-”) and hbase:meta tables are forced into the block cache and have the in-memory priority which means that they are harder to evict. The former never uses more than a few hundreds of bytes while the latter can occupy a few MBs (depending on the number of regions).
HFiles Indexes: An hfile is the file format that HBase uses to store data in HDFS. It contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The size of those indexes is a factor of the block size (64KB by default), the size of your keys and the amount of data you are storing. For big data sets it's not unusual to see numbers around 1GB per region server, although not all of it will be in cache because the LRU will evict indexes that aren't used.
Keys: The values that are stored are only half the picture, since each value is stored along with its keys (row key, family qualifier, and timestamp). See Section 6.3.3, “Try to minimize row and column sizes”.
Bloom Filters: Just like the HFile indexes, those data structures (when enabled) are stored in the LRU.

Currently the recommended way to measure HFile indexes and bloom filters sizes is to look at the region server web UI and checkout the relevant metrics. For keys, sampling can be done by using the HFile command line tool and look for the average key size metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics in a special Block Cache section in the UI.

It's generally bad to use block caching when the WSS doesn't fit in memory. This is the case when you have for example 40GB available across all your region servers' block caches but you need to process 1TB of data. One of the reasons is that the churn generated by the evictions will trigger more garbage collections unnecessarily. Here are two use cases:

Fully random reading pattern: This is a case where you almost never access the same row twice within a short amount of time such that the chance of hitting a cached block is close to 0. Setting block caching on such a table is a waste of memory and CPU cycles, more so that it will generate more garbage to pick up by the JVM. For more information on monitoring GC, see Section 15.2.3, “JVM Garbage Collection Logs”.
Mapping a table: In a typical MapReduce job that takes a table in input, every row will be read only once so there's no need to put them into the block cache. The Scan object has the option of turning this off via the setCaching method (set it to false). You can still keep block caching turned on on this table if you need fast random read access. An example would be counting the number of rows in a table that serves live traffic, caching every block of that table would create massive churn and would surely evict data that's currently in use.

9.6.4.4.1. Caching META blocks only (DATA blocks in fscache)

An interesting setup is one where we cache META blocks only and we read DATA blocks in on each access. If the DATA blocks fit inside fscache, this alternative may make sense when access is completely random across a very large dataset. To enable this setup, alter your table and for each column family set BLOCKCACHE => 'false'. You are 'disabling' the BlockCache for this column family only you can never disable the caching of META blocks. Since HBASE-4683 Always cache index and bloom blocks, we will cache META blocks even if the BlockCache is disabled.

9.6.4.5. Offheap Block Cache

9.6.4.5.1. How to Enable BucketCache

The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is CombinedBlockCache by default. The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by setting cacheDataInL1 via (HColumnDescriptor.setCacheDataInL1(true) or in the shell, creating or amending column families setting CACHE_DATA_IN_L1 to true: e.g.

hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}

The BucketCache Block Cache can be deployed onheap, offheap, or file based. You set which via the hbase.bucketcache.ioengine setting. Setting it to heap will have BucketCache deployed inside the allocated java heap. Setting it to offheap will have BucketCache make its allocations offheap, and an ioengine setting of file:PATH_TO_FILE will direct BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such as SSDs).

It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache. For such a setup, set CacheConfig.BUCKET_CACHE_COMBINED_KEY to false. In this mode, on eviction from L1, blocks go to L2. When a block is cached, it is cached first in L1. When we go to look for a cached block, we look first in L1 and if none found, then search L2. Let us call this deploy format, .

Other BucketCache configs include: specifying a location to persist cache to across restarts, how many threads to use writing the cache, etc. See the CacheConfig.html class for configuration options and descriptions.

Procedure 9.1. BucketCache Example Configuration

This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB onheap cache. Configuration is performed on the RegionServer. Setting hbase.bucketcache.ioengine and hbase.bucketcache.size > 0 enables CombinedBlockCache. Let us presume that the RegionServer has been set to run with a 5G heap: i.e. HBASE_HEAPSIZE=5g.

First, edit the RegionServer's hbase-env.sh and set -XX:MaxDirectMemorySize to a value greater than the offheap size wanted, in this case, 4 GB (expressed as 4G). Lets set it to 5G. That'll be 4G for our offheap cache and 1G for any other uses of offheap memory (there are other users of offheap memory other than BlockCache; e.g. DFSClient in RegionServer can make use of offheap memory). See Direct Memory Usage In HBase.
```
-XX:MaxDirectMemorySize=5G
```

Next, add the following configuration to the RegionServer's hbase-site.xml.

<property>
  <name>hbase.bucketcache.ioengine</name>
  <value>offheap</value>
</property>
<property>
  <name>hfile.block.cache.size</name>
  <value>0.2</value>
</property>
<property>
  <name>hbase.bucketcache.size</name>
  <value>4196</value>
</property>

Restart or rolling restart your cluster, and check the logs for any issues.

In the above, we set bucketcache to be 4G. The onheap lrublockcache we configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G). In other words, you configure the L1 LruBlockCache as you would normally, as you would when there is no L2 BucketCache present.

HBASE-10641 introduced the ability to configure multiple sizes for the buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket sizes, configure the new property hfile.block.cache.sizes (instead of hfile.block.cache.size) to a comma-separated list of block sizes, ordered from smallest to largest, with no spaces. The goal is to optimize the bucket sizes based on your data access patterns. The following example configures buckets of size 4096 and 8192.

<property>
  <name>hfile.block.cache.sizes</name>
  <value>4096,8192</value>
</property>

Direct Memory Usage In HBase

The default maximum direct memory varies by JVM. Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will allocate direct memory buffers. If you do offheap block caching, you'll be making use of direct memory. Starting your JVM, make sure the -XX:MaxDirectMemorySize setting in conf/hbase-env.sh is set to some value that is higher than what you have allocated to your offheap blockcache (hbase.bucketcache.size). It should be larger than your offheap block cache and then some for DFSClient usage (How much the DFSClient uses is not easy to quantify; it is the number of open hfiles * hbase.dfs.client.read.shortcircuit.buffer.size where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see hbase-default.xml default configurations). Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed physical RAM, and is likely to be less than the total available RAM due to other memory requirements and system constraints.

You can see how much memory -- onheap and offheap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the Server Metrics: Memory tab in the UI. It can also be gotten via JMX. In particular the direct memory currently used by the server can be found on the java.nio.type=BufferPool,name=direct bean. Terracotta has a good write up on using offheap memory in java. It is for their product BigMemory but alot of the issues noted apply in general to any attempt at going offheap. Check it out.

hbase.bucketcache.percentage.in.combinedcache

This is a pre-HBase 1.0 configuration removed because it was confusing. It was a float that you would set to some value between 0.0 and 1.0. Its default was 0.9. If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was calculated to be (1 - hbase.bucketcache.percentage.in.combinedcache) * size-of-bucketcache and the BucketCache size was hbase.bucketcache.percentage.in.combinedcache * size-of-bucket-cache. where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size IF it was specified as megabytes OR hbase.bucketcache.size * -XX:MaxDirectMemorySize if hbase.bucketcache.size between 0 and 1.0.

In 1.0, it should be more straight-forward. L1 LruBlockCache size is set as a fraction of java heap using hfile.block.cache.size setting (not the best name) and L2 is set as above either in absolute megabytes or as a fraction of allocated maximum direct memory.

9.6.5. Write Ahead Log (WAL)

9.6.5.1. Purpose

The Write Ahead Log (WAL) records all changes to data in HBase, to file-based storage. Under normal operations, the WAL is not needed because data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to the data can be replayed. If writing to the WAL fails, the entire operation to modify the data fails.

HBase uses an implementation of the HLog interface for the WAL. Usually, there is only one instance of a WAL per RegionServer. The RegionServer records Puts and Deletes to it, before recording them to the Section 9.7.7.1, “MemStore” for the affected Section 9.7.7, “Store”.

The WAL resides in HDFS in the /hbase/WALs/ directory (prior to HBase 0.94, they were stored in /hbase/.logs/), with subdirectories per region.

For more general information about the concept of write ahead logs, see the Wikipedia Write-Ahead Log article.

9.6.5.2. WAL Flushing

TODO (describe).

9.6.5.3. WAL Splitting

A RegionServer serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file includes information about which region it belongs to. When a region is opened, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.

Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler as a region server shuts down. So that consistency is guaranteed, affected regions are unavailable until data is restored. All WAL edits need to be recovered and replayed before a given region can become available again. As a result, regions affected by log splitting are unavailable until the process completes.

Procedure 9.2. Log Splitting, Step by Step

The /hbase/WALs/<host>,<port>,<startcode> directory is renamed.
Renaming the directory is important because a RegionServer may still be up and accepting requests even if the HMaster thinks it is down. If the RegionServer does not respond immediately and does not heartbeat its ZooKeeper session, the HMaster may interpret this as a RegionServer failure. Renaming the logs directory ensures that existing, valid WAL files which are still in use by an active but busy RegionServer are not written to by accident.
The new directory is named according to the following pattern:
```
/hbase/WALs/<host>,<port>,<startcode>-splitting
```
An example of such a renamed directory might look like the following:
```
/hbase/WALs/srv.example.com,60020,1254173957298-splitting
```
Each log file is split, one at a time.
The log splitter reads the log file one edit entry at a time and puts each edit entry into the buffer corresponding to the edit’s region. At the same time, the splitter starts several writer threads. Writer threads pick up a corresponding buffer and write the edit entries in the buffer to a temporary recovered edit file. The temporary edit file is stored to disk with the following naming pattern:
```
/hbase/<table_name>/<region_id>/recovered.edits/.temp
```
This file is used to store all the edits in the WAL log for this region. After log splitting completes, the .temp file is renamed to the sequence ID of the first log written to the file.
To determine whether all edits have been written, the sequence ID is compared to the sequence of the last edit that was written to the HFile. If the sequence of the last edit is greater than or equal to the sequence ID included in the file name, it is clear that all writes from the edit file have been completed.
After log splitting is complete, each affected region is assigned to a RegionServer.
When the region is opened, the recovered.edits folder is checked for recovered edits files. If any such files are present, they are replayed by reading the edits and saving them to the MemStore. After all edit files are replayed, the contents of the MemStore are written to disk (HFile) and the edit files are deleted.

9.6.5.3.1. Handling of Errors During Log Splitting

If you set the hbase.hlog.split.skip.errors option to true, errors are treated as follows:

Any error encountered during splitting will be logged.
The problematic WAL log will be moved into the .corrupt directory under the hbase rootdir,
Processing of the WAL will continue

If the hbase.hlog.split.skip.errors optionset to false, the default, the exception will be propagated and the split will be logged as failed. See HBASE-2958 When hbase.hlog.split.skip.errors is set to false, we fail the split but thats it. We need to do more than just fail split if this flag is set.

9.6.5.3.1.1. How EOFExceptions are treated when splitting a crashed RegionServers' WALs

If an EOFException occurs while splitting logs, the split proceeds even when hbase.hlog.split.skip.errors is set to false. An EOFException while reading the last log in the set of files to split is likely, because the RegionServer is likely to be in the process of writing a record at the time of a crash. For background, see HBASE-2643 Figure how to deal with eof splitting logs

9.6.5.3.2. Performance Improvements during Log Splitting

WAL log splitting and recovery can be resource intensive and take a long time, depending on the number of RegionServers involved in the crash and the size of the regions. Section 9.6.5.3.2.1, “Distributed Log Splitting” and Section 9.6.5.3.2.2, “Distributed Log Replay” were developed to improve performance during log splitting.

9.6.5.3.2.1. Distributed Log Splitting

Distributed Log Splitting was added in HBase version 0.92 (HBASE-1364) by Prakash Khemani from Facebook. It reduces the time to complete log splitting dramatically, improving the availability of regions and tables. For example, recovering a crashed cluster took around 9 hours with single-threaded log splitting, but only about six minutes with distributed log splitting.

The information in this section is sourced from Jimmy Xiang's blog post at http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/.

Enabling or Disabling Distributed Log Splitting. Distributed log processing is enabled by default since HBase 0.92. The setting is controlled by the hbase.master.distributed.log.splitting property, which can be set to true or false, but defaults to true.

Procedure 9.3. Distributed Log Splitting, Step by Step

After configuring distributed log splitting, the HMaster controls the process. The HMaster enrolls each RegionServer in the log splitting process, and the actual work of splitting the logs is done by the RegionServers. The general process for log splitting, as described in Procedure 9.2, “Log Splitting, Step by Step” still applies here.

If distributed log processing is enabled, the HMaster creates a split log manager instance when the cluster is started. The split log manager manages all log files which need to be scanned and split. The split log manager places all the logs into the ZooKeeper splitlog node (/hbase/splitlog) as tasks. You can view the contents of the splitlog by issuing the following zkcli command. Example output is shown.

ls /hbase/splitlog
[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, 
hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, 
hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]

The output contains some non-ASCII characters. When decoded, it looks much more simple:

[hdfs://host2.sample.com:56020/hbase/.logs
/host8.sample.com,57020,1340474893275-splitting
/host8.sample.com%3A57020.1340474893900, 
hdfs://host2.sample.com:56020/hbase/.logs
/host3.sample.com,57020,1340474893299-splitting
/host3.sample.com%3A57020.1340474893931, 
hdfs://host2.sample.com:56020/hbase/.logs
/host4.sample.com,57020,1340474893287-splitting
/host4.sample.com%3A57020.1340474893946]

The listing represents WAL file names to be scanned and split, which is a list of log splitting tasks.

The split log manager monitors the log-splitting tasks and workers.
The split log manager is responsible for the following ongoing tasks:
- Once the split log manager publishes all the tasks to the splitlog znode, it monitors these task nodes and waits for them to be processed.
- Checks to see if there are any dead split log workers queued up. If it finds tasks claimed by unresponsive workers, it will resubmit those tasks. If the resubmit fails due to some ZooKeeper exception, the dead worker is queued up again for retry.
- Checks to see if there are any unassigned tasks. If it finds any, it create an ephemeral rescan node so that each split log worker is notified to re-scan unassigned tasks via the nodeChildrenChanged ZooKeeper event.
- Checks for tasks which are assigned but expired. If any are found, they are moved back to TASK_UNASSIGNED state again so that they can be retried. It is possible that these tasks are assigned to slow workers, or they may already be finished. This is not a problem, because log splitting tasks have the property of idempotence. In other words, the same log splitting task can be processed many times without causing any problem.
- The split log manager watches the HBase split log znodes constantly. If any split log task node data is changed, the split log manager retrieves the node data. The node data contains the current state of the task. You can use the zkcli get command to retrieve the current state of a task. In the example output below, the first line of the output shows that the task is currently unassigned.
```
get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
 
unassigned host2.sample.com:57000
cZxid = 0×7115
ctime = Sat Jun 23 11:13:40 PDT 2012
...  
                      
```
  Based on the state of the task whose data is changed, the split log manager does one of the following:
  - Resubmit the task if it is unassigned
  - Heartbeat the task if it is assigned
  - Resubmit or fail the task if it is resigned (see Reasons a Task Will Fail)
  - Resubmit or fail the task if it is completed with errors (see Reasons a Task Will Fail)
  - Resubmit or fail the task if it could not complete due to errors (see Reasons a Task Will Fail)
  - Delete the task if it is successfully completed or failed
  Reasons a Task Will Fail
  - The task has been deleted.
  - The node no longer exists.
  - The log status manager failed to move the state of the task to TASK_UNASSIGNED.
  - The number of resubmits is over the resubmit threshold.
Each RegionServer's split log worker performs the log-splitting tasks.
Each RegionServer runs a daemon thread called the split log worker, which does the work to split the logs. The daemon thread starts when the RegionServer starts, and registers itself to watch HBase znodes. If any splitlog znode children change, it notifies a sleeping worker thread to wake up and grab more tasks. If if a worker's current task’s node data is changed, the worker checks to see if the task has been taken by another worker. If so, the worker thread stops work on the current task.
The worker monitors the splitlog znode constantly. When a new task appears, the split log worker retrieves the task paths and checks each one until it finds an unclaimed task, which it attempts to claim. If the claim was successful, it attempts to perform the task and updates the task's state property based on the splitting outcome. At this point, the split log worker scans for another unclaimed task.
How the Split Log Worker Approaches a Task
- It queries the task state and only takes action if the task is in TASK_UNASSIGNED state.
- If the task is is in TASK_UNASSIGNED state, the worker attempts to set the state to TASK_OWNED by itself. If it fails to set the state, another worker will try to grab it. The split log manager will also ask all workers to rescan later if the task remains unassigned.
- If the worker succeeds in taking ownership of the task, it tries to get the task state again to make sure it really gets it asynchronously. In the meantime, it starts a split task executor to do the actual work:
  - Get the HBase root folder, create a temp folder under the root, and split the log file to the temp folder.
  - If the split was successful, the task executor sets the task to state TASK_DONE.
  - If the worker catches an unexpected IOException, the task is set to state TASK_ERR.
  - If the worker is shutting down, set the the task to state TASK_RESIGNED.
  - If the task is taken by another worker, just log it.
The split log manager monitors for uncompleted tasks.
The split log manager returns when all tasks are completed successfully. If all tasks are completed with some failures, the split log manager throws an exception so that the log splitting can be retried. Due to an asynchronous implementation, in very rare cases, the split log manager loses track of some completed tasks. For that reason, it periodically checks for remaining uncompleted task in its task map or ZooKeeper. If none are found, it throws an exception so that the log splitting can be retried right away instead of hanging there waiting for something that won’t happen.

9.6.5.3.2.2. Distributed Log Replay

After a RegionServer fails, its failed region is assigned to another RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly replays edits from the WAL of the failed region server to the region at its new location. When a region is in "recovering" state, it can accept writes but no reads (including Append and Increment), region splits or merges.

Distributed Log Replay extends the Section 9.6.5.3.2.1, “Distributed Log Splitting” framework. It works by directly replaying WAL edits to another RegionServer instead of creating recovered.edits files. It provides the following advantages over distributed log splitting alone:

It eliminates the overhead of writing and reading a large number of recovered.edits files. It is not unusual for thousands of recovered.edits files to be created and written concurrently during a RegionServer recovery. Many small random writes can degrade overall system performance.
It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again.

Enabling Distributed Log Replay. To enable distributed log replay, set hbase.master.distributed.log.replay to true. This will be the default for HBase 0.99 (HBASE-10888).

You must also enable HFile version 3 (which is the default HFile format starting in HBase 0.99. See HBASE-10855). Distributed log replay is unsafe for rolling upgrades.

9.6.5.4. Disabling the WAL

It is possible to disable the WAL, to improve performace in certain specific situations. However, disabling the WAL puts your data at risk. The only situation where this is recommended is during a bulk load. This is because, in the event of a problem, the bulk load can be re-run with no risk of data loss.

The WAL is disabled by calling the HBase client field Mutation.writeToWAL(false). Use the Mutation.setDurability(Durability.SKIP_WAL) and Mutation.getDurability() methods to set and get the field's value. There is no way to disable the WAL for only a specific table.

Warning

If you disable the WAL for anything other than bulk loads, your data is at risk.

Prev	Up	Next
9.5. Master	Home	9.7. Regions