You can stop an individual RegionServer by running the following script in the HBase directory on the particular node:
$ ./bin/hbase-daemon.sh stop regionserver
The RegionServer will first close all regions and then shut itself down. On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire. The master will notice the RegionServer gone and will treat it as a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
If the load balancer runs while a node is shutting down, then there could be contention between the Load Balancer and the Master's recovery of the just decommissioned RegionServer. Avoid any problems by disabling the balancer first. See Load Balancer below.
In hbase-2.0, in the bin directory, we added a script named
considerAsDead.sh
that can be used to kill a regionserver.
Hardware issues could be detected by specialized monitoring tools before the
zookeeper timeout has expired. considerAsDead.sh
is a
simple function to mark a RegionServer as dead. It deletes all the znodes
of the server, starting the recovery process. Plug in the script into
your monitoring/fault detection tools to initiate faster failover. Be
careful how you use this disruptive tool. Copy the script if you need to
make use of it in a version of hbase previous to hbase-2.0.
A downside to the above stop of a RegionServer is that regions could be offline for a
good period of time. Regions are closed in order. If many regions on the server, the first
region to close may not be back online until all regions close and after the master notices
the RegionServer's znode gone. In Apache HBase 0.90.2, we added facility for having a node
gradually shed its load and then shutdown itself down. Apache HBase 0.90.2 added the
graceful_stop.sh
script. Here is its usage:
$ ./bin/graceful_stop.sh Usage: graceful_stop.sh [--config &conf-dir>] [--restart] [--reload] [--thrift] [--rest] &hostname> thrift If we should stop/start thrift before/after the hbase stop/start rest If we should stop/start rest before/after the hbase stop/start restart If we should restart after graceful stop reload Move offloaded regions back on to the stopped server debug Move offloaded regions back on to the stopped server hostname Hostname of server we are to stop
To decommission a loaded RegionServer, run the following: $
./bin/graceful_stop.sh HOSTNAME where HOSTNAME
is the host
carrying the RegionServer you would decommission.
HOSTNAME
The HOSTNAME
passed to graceful_stop.sh
must
match the hostname that hbase is using to identify RegionServers. Check the list of
RegionServers in the master UI for how HBase is referring to servers. Its usually hostname
but can also be FQDN. Whatever HBase is using, this is what you should pass the
graceful_stop.sh
decommission script. If you pass IPs, the script
is not yet smart enough to make a hostname (or FQDN) of it and so it will fail when it
checks if server is currently running; the graceful unloading of regions will not run.
The graceful_stop.sh
script will move the regions off the
decommissioned RegionServer one at a time to minimize region churn. It will verify the
region deployed in the new location before it will moves the next region and so on until the
decommissioned server is carrying zero regions. At this point, the
graceful_stop.sh
tells the RegionServer stop. The
master will at this point notice the RegionServer gone but all regions will have already
been redeployed and because the RegionServer went down cleanly, there will be no WAL logs to
split.
It is assumed that the Region Load Balancer is disabled while the graceful_stop script runs (otherwise the balancer and the decommission script will end up fighting over region deployments). Use the shell to disable the balancer:
hbase(main):001:0> balance_switch false true 0 row(s) in 0.3590 seconds
This turns the balancer OFF. To reenable, do:
hbase(main):001:0> balance_switch true false 0 row(s) in 0.3590 seconds
The graceful_stop will check the balancer and if enabled, will turn it off before it goes to work. If it exits prematurely because of error, it will not have reset the balancer. Hence, it is better to manage the balancer apart from graceful_stop reenabling it after you are done w/ graceful_stop.
If you have a large cluster, you may want to decommission more than one machine at a
time by gracefully stopping mutiple RegionServers concurrently. To gracefully drain
multiple regionservers at the same time, RegionServers can be put into a "draining" state.
This is done by marking a RegionServer as a draining node by creating an entry in
ZooKeeper under the hbase_root/draining
znode. This znode has format
name,port,startcode
just like the regionserver entries under
hbase_root/rs
znode.
Without this facility, decommissioning mulitple nodes may be non-optimal because regions that are being drained from one region server may be moved to other regionservers that are also draining. Marking RegionServers to be in the draining state prevents this from happening. See this blog post for more details.
It is good having Section 2.6.2.2.1, “dfs.datanode.failed.volumes.tolerated” set if you have a decent number of
disks per machine for the case where a disk plain dies. But usually disks do the "John
Wayne" -- i.e. take a while to go down spewing errors in dmesg
-- or
for some reason, run much slower than their companions. In this case you want to
decommission the disk. You have two options. You can decommission
the datanode or, less disruptive in that only the bad disks data will be
rereplicated, can stop the datanode, unmount the bad volume (You can't umount a volume
while the datanode is using it), and then restart the datanode (presuming you have set
dfs.datanode.failed.volumes.tolerated > 0). The regionserver will throw some errors in its
logs as it recalibrates where to get its data from -- it will likely roll its WAL log too
-- but in general but for some latency spikes, it should keep on chugging.
If you are doing short-circuit reads, you will have to move the regions off the regionserver before you stop the datanode; when short-circuiting reading, though chmod'd so regionserver cannot have access, because it already has the files open, it will be able to keep reading the file blocks from the bad disk even though the datanode is down. Move the regions back after you restart the datanode.
Some cluster configuration changes require either the entire cluster, or the RegionServers, to be restarted in order to pick up the changes. In addition, rolling restarts are supported for upgrading to a minor or maintenance release, and to a major release if at all possible. See the release notes for release you want to upgrade to, to find out about limitations to the ability to perform a rolling upgrade.
There are multiple ways to restart your cluster nodes, depending on your situation. These methods are detailed below.
HBase ships with a script, bin/rolling-restart.sh
, that allows
you to perform rolling restarts on the entire cluster, the master only, or the
RegionServers only. The script is provided as a template for your own script, and is not
explicitly tested. It requires password-less SSH login to be configured and assumes that
you have deployed using a tarball. The script requires you to set some environment
variables before running it. Examine the script and modify it to suit your needs.
Example 17.1. rolling-restart.sh
General Usage
$ ./bin/rolling-restart.sh --help
Usage: rolling-restart.sh [--config <hbase-confdir>] [--rs-only] [--master-only] [--graceful] [--maxthreads xx]
To perform a rolling restart on the RegionServers only, use the
--rs-only
option. This might be necessary if you need to reboot the
individual RegionServer or if you make a configuration change that only affects
RegionServers and not the other HBase processes.
If you need to restart only a single RegionServer, or if you need to do extra
actions during the restart, use the bin/graceful_stop.sh
command instead. See Section 17.3.2.2, “Manual Rolling Restart”.
To perform a rolling restart on the active and backup Masters, use the
--master-only
option. You might use this if you know that your
configuration change only affects the Master and not the RegionServers, or if you
need to restart the server where the active Master is running.
If you are not running backup Masters, the Master is simply restarted. If you are running backup Masters, they are all stopped before any are restarted, to avoid a race condition in ZooKeeper to determine which is the new Master. First the main Master is restarted, then the backup Masters are restarted. Directly after restart, it checks for and cleans out any regions in transition before taking on its normal workload.
If you specify the --graceful
option, RegionServers are restarted
using the bin/graceful_stop.sh
script, which moves regions off
a RegionServer before restarting it. This is safer, but can delay the
restart.
To limit the rolling restart to using only a specific number of threads, use the
--maxthreads
option.
To retain more control over the process, you may wish to manually do a rolling restart across your cluster. This uses the graceful-stop.sh command Section 17.3.1, “Node Decommission”. In this method, you can restart each RegionServer individually and then move its old regions back into place, retaining locality. If you also need to restart the Master, you need to do it separately, and restart the Master before restarting the RegionServers using this method. The following is an example of such a command. You may need to tailor it to your environment. This script does a rolling restart of RegionServers only. It disables the load balancer before moving the regions.
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &;
Monitor the output of the /tmp/log.txt
file to follow the
progress of the script.
Use the following guidelines if you want to create your own rolling restart script.
Extract the new release, verify its configuration, and synchronize it to all nodes of your cluster using rsync, scp, or another secure synchronization mechanism.
Use the hbck utility to ensure that the cluster is consistent.
$ ./bin/hbck
Perform repairs if required. See Section 17.1.4, “HBase hbck” for details.
Restart the master first. You may need to modify these commands if your new HBase directory is different from the old one, such as for an upgrade.
$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master
Gracefully restart each RegionServer, using a script such as the following, from the Master.
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &> /tmp/log.txt &
If you are running Thrift or REST servers, pass the --thrift or --rest options. For other available options, run the bin/graceful-stop.sh --help command.
It is important to drain HBase regions slowly when restarting multiple RegionServers. Otherwise, multiple regions go offline simultaneously and must be reassigned to other nodes, which may also go offline soon. This can negatively affect performance. You can inject delays into the script above, for instance, by adding a Shell command such as sleep. To wait for 5 minutes between each RegionServer restart, modify the above script to the following:
$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i & sleep 5m; done &> /tmp/log.txt &
Restart the Master again, to clear out the dead servers list and re-enable the load balancer.
Run the hbck utility again, to be sure the cluster is consistent.
Adding a new regionserver in HBase is essentially free, you simply start it like this:
$ ./bin/hbase-daemon.sh start regionserver and it will register itself
with the master. Ideally you also started a DataNode on the same machine so that the RS can
eventually start to have local files. If you rely on ssh to start your daemons, don't forget
to add the new hostname in conf/regionservers
on the master.
At this point the region server isn't serving data because no regions have moved to it yet. If the balancer is enabled, it will start moving regions to the new RS. On a small/medium cluster this can have a very adverse effect on latency as a lot of regions will be offline at the same time. It is thus recommended to disable the balancer the same way it's done when decommissioning a node and move the regions manually (or even better, using a script that moves them one by one).
The moved regions will all have 0% locality and won't have any blocks in cache so the region server will have to use the network to serve requests. Apart from resulting in higher latency, it may also be able to use all of your network card's capacity. For practical purposes, consider that a standard 1GigE NIC won't be able to read much more than 100MB/s. In this case, or if you are in a OLAP environment and require having locality, then it is recommended to major compact the moved regions.