Hive Kudu Integration
Overview
Apache Kudu is a an Open Source data storage engine that makes fast analytics on fast and changing data easy.
Implementation
The initial implementation was added to Hive 4.0 in HIVE-12971 and is designed to work with Kudu 1.2+.
There are two main components which make up the implementation: the KuduStorageHandler
and the KuduPredicateHandler
. The KuduStorageHandler
is a Hive StorageHandler implementation. The primary roles of this class are to manage the mapping of a Hive table to a Kudu table and configures Hive queries. The KuduPredicateHandler is used push down filter operations to Kudu for more efficient IO.
NOTE: The initial implementation is considered experimental
as there are remaining sub-jiras open to make the implementation more configurable and performant. Currently only external tables pointing at existing Kudu tables are supported. Support for creating and altering underlying Kudu tables in tracked via HIVE-22021. Additionally full support for UPDATE, UPSERT, and DELETE statement support is tracked by HIVE-22027.
Hive Configuration
To issue queries against Kudu using Hive, one optional parameter can be provided by the Hive configuration:
Hive Configuration | |
---|---|
hive.kudu.master.addresses.default | Comma-separated list of all of the Kudu master addresses. This value is only used for a given table if the kudu.master_addresses table property is not set. |
For those familiar with Kudu, the master addresses configuration is the normal configuration value necessary to connect to Kudu. The easiest way to provide this value is by using the -hiveconf
option to the hive
command.
hive -hiveconf hive.kudu.master.addresses.default=localhost:7051
Table Creation
To access Kudu tables, a Hive table must be created using the CREATE
command with the STORED BY
clause. Until HIVE-22021 is completed, the EXTERNAL
keyword is required and will create a Hive table that references an existing Kudu table. Dropping the external Hive table will not remove the underlying Kudu table.
CREATE EXTERNAL TABLE kudu_table (foo INT, bar STRING, baz DOUBLE) STORED BY 'org.apache.hadoop.hive.kudu.KuduStorageHandler' TBLPROPERTIES ( "kudu.table_name"="default.kudu_table", "kudu.master_addresses"="localhost:7051" );
In the above statement, normal Hive column name and type pairs are provided as is the case with normal create table statements. The full KuduStorageHandler
class name is provided to inform Hive that Kudu will back this Hive table. A number of TBLPROPERTIES can be provided to configure the KuduStorageHandler. The most important property is kudu.table_name
which tells hive which Kudu table it should reference. The other common property is kudu.master_addresses
which configures the Kudu master addresses for this table. If the kudu.master_addresses
property is not provided, the hive.kudu.master.addresses.default
configuration will be used.
Impala Tables
Because Impala creates tables with the same storage handler metadata in the HiveMetastore, tables created or altered via Impala DDL can be accessed from Hive. This is especially useful until HIVE-22021 is complete and full DDL support is available through Hive. See the Kudu documentation and the Impala documentation for more details.
Data Ingest
Though it is a common practice to ingest the data into Kudu tables via tools like Apache NiFi or Apache Spark and query the data via Hive, data can also be inserted to the Kudu tables via Hive INSERT statements. It is important to note that when data is inserted a Kudu UPSERT operation is actually used to avoid primary key constraint issues. Making this more flexible is tracked via HIVE-22024. Additionally UPDATE and DELETE operations are not supported. Enabling that functionality is tracked via HIVE-22027.
Examples
INSERT INTO kudu_table SELECT * FROM other_table; INSERT INTO TABLE kudu_table VALUES (1, 'test 1', 1.1), (2, 'test 2', 2.2);