HBase has two run modes: Section 2.2.1, “Standalone HBase” and Section 2.2.2, “Distributed”. Out of the box, HBase runs in standalone mode. Whatever your mode,
you will need to configure HBase by editing files in the HBase conf
directory. At a minimum, you must edit conf/hbase-env.sh
to tell HBase which
java to use. In this file you set HBase environment variables such as the
heapsize and other options for the JVM, the preferred location for
log files, etc. Set JAVA_HOME
to point at the root of your
java install.
This is the default mode. Standalone mode is what is described in the Section 1.2, “Quick Start - Standalone HBase” section. In standalone mode, HBase does not use HDFS -- it uses the local filesystem instead -- and it runs all HBase daemons and a local ZooKeeper all up in the same JVM. Zookeeper binds to a well known port so clients may talk to HBase.
Distributed mode can be subdivided into distributed but all daemons run on a single node -- a.k.a pseudo-distributed-- and fully-distributed where the daemons are spread across all nodes in the cluster. The pseudo-distributed vs fully-distributed nomenclature comes from Hadoop.
Pseudo-distributed mode can run against the local filesystem or it can run against an instance of the Hadoop Distributed File System (HDFS). Fully-distributed mode can ONLY run on HDFS. See the Hadoop requirements and instructions for how to set up HDFS for Hadoop 1.x. A good walk-through for setting up HDFS on Hadoop 2 is at http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide.
Below we describe the different distributed setups. Starting, verification and exploration of your install, whether a pseudo-distributed or fully-distributed configuration is described in a section that follows, Section 2.3, “Running and Confirming Your Installation”. The same verification script applies to both deploy types.
A quickstart has been added to the Section 1.2, “Quick Start - Standalone HBase” chapter. See Section 1.2.3, “Intermediate - Pseudo-Distributed Local Install”. Some of the information that was originally in this section has been moved there.
A pseudo-distributed mode is simply a fully-distributed mode run on a single host. Use this configuration testing and prototyping on HBase. Do not use this configuration for production nor for evaluating HBase performance.
By default, HBase runs in standalone mode. Both standalone mode and pseudo-distributed mode are provided for the purposes of small-scale testing. For a production environment, distributed mode is appropriate. In distributed mode, multiple instances of HBase daemons run on multiple servers in the cluster.
Just as in pseudo-distributed mode, a fully distributed configuration requires that you
set the hbase-cluster.distributed
property to true
.
Typically, the hbase.rootdir
is configured to point to a highly-available HDFS
filesystem.
In addition, the cluster is configured so that multiple cluster nodes enlist as RegionServers, ZooKeeper QuorumPeers, and backup HMaster servers. These configuration basics are all demonstrated in Section 1.2.4, “Advanced - Fully Distributed”.
Distributed RegionServers. Typically, your cluster will contain multiple RegionServers all running on different
servers, as well as primary and backup Master and Zookeeper daemons. The
conf/regionservers
file on the master server contains a list of
hosts whose RegionServers are associated with this cluster. Each host is on a separate
line. All hosts listed in this file will have their RegionServer processes started and
stopped when the master server starts or stops.
ZooKeeper and HBase. See section Chapter 20, ZooKeeper for ZooKeeper setup for HBase.
Example 2.2. Example Distributed HBase Cluster
This is a bare-bones conf/hbase-site.xml
for a distributed HBase
cluster. A cluster that is used for real-world work would contain more custom
configuration parameters. Most HBase configuration directives have default values, which
are used unless the value is overridden in the hbase-site.xml
. See Section 2.4, “Configuration Files” for more information.
<configuration> <property> <name>hbase.rootdir</name> <value>hdfs://namenode.example.org:8020/hbase</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>node-a.example.com,node-b.example.com,node-c.example.com</value> </property> </configuration>
This is an example conf/regionservers
file, which contains a list
of each node that should run a RegionServer in the cluster. These nodes need HBase
installed and they need to use the same contents of the conf/
directory as the Master server..
node-a.example.com node-b.example.com node-c.example.com
This is an example conf/backup-masters
file, which contains a
list of each node that should run a backup Master instance. The backup Master instances
will sit idle unless the main Master becomes unavailable.
node-b.example.com node-c.example.com
Distributed HBase Quickstart. See Section 1.2.4, “Advanced - Fully Distributed” for a walk-through of a simple three-node cluster configuration with multiple ZooKeeper, backup HMaster, and RegionServer instances.
Procedure 2.1. HDFS Client Configuration
Of note, if you have made HDFS client configuration on your Hadoop cluster, such as configuration directives for HDFS clients, as opposed to server-side configurations, you must use one of the following methods to enable HBase to see and use these configuration changes:
Add a pointer to your HADOOP_CONF_DIR
to the
HBASE_CLASSPATH
environment variable in
hbase-env.sh
.
Add a copy of hdfs-site.xml
(or
hadoop-site.xml
) or, better, symlinks, under
${HBASE_HOME}/conf
, or
if only a small set of HDFS client configurations, add them to
hbase-site.xml
.
An example of such an HDFS client configuration is dfs.replication
.
If for example, you want to run with a replication factor of 5, hbase will create files with
the default of 3 unless you do the above to make the configuration available to
HBase.