Learn how to use the HDFS Handler, which is designed to stream change capture data into the Hadoop Distributed File System (HDFS).
Topics:
The HDFS is the primary file system for Big Data. Hadoop is typically installed on multiple machines that work together as a Hadoop cluster. Hadoop allows you to store very large amounts of data in the cluster that is horizontally scaled across the machines in the cluster. You can then perform analytics on that data using a variety of Big Data applications.
Parent topic: Using the HDFS Handler
The HDFS SequenceFile
is a flat file consisting of binary key and value pairs. You can enable writing data in SequenceFile
format by setting the gg.handler.name.format
property to sequencefile
. The key
part of the record is set to null, and the actual data is set in the value
part. For information about Hadoop SequenceFile
, see https://wiki.apache.org/hadoop/SequenceFile.
Topics:
Oracle GoldenGate for Big Data release does not include a Hive storage handler because the HDFS Handler provides all of the necessary Hive functionality .
You can create a Hive integration to create tables and update table definitions in case of DDL events. This is limited to data formatted in Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.
For Hive to consume sequence files, the DDL creates Hive tables including STORED as sequencefile
. The following is a sample create table
script:
CREATE EXTERNAL TABLE table_name (
col1 string,
...
...
col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';
Note:
If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable
property should be set to true
.
Parent topic: Writing into HDFS in SequenceFile Format
The data written in the value
part of each record and is in delimited text format. All of the options described in the Using the Delimited Text Formatter section are applicable to HDFS SequenceFile when writing data to it.
For example:
gg.handler.name.format=sequencefile gg.handler.name.format.includeColumnNames=true gg.handler.name.format.includeOpType=true gg.handler.name.format.includeCurrentTimestamp=true gg.handler.name.format.updateOpKey=U
Parent topic: Writing into HDFS in SequenceFile Format
To run the HDFS Handler, a Hadoop single instance or Hadoop cluster must be installed, running, and network-accessible from the machine running the HDFS Handler. Apache Hadoop is open source and you can download it from:
Follow the Getting Started links for information on how to install a single-node cluster (for pseudo-distributed operation mode) or a clustered setup (for fully-distributed operation mode).
Instructions for configuring the HDFS Handler components and running the handler are described in the following sections.
Parent topic: Using the HDFS Handler
For the HDFS Handler to connect to HDFS and run, the HDFS core-site.xml
file and the HDFS client jars must be configured in gg.classpath
variable. The HDFS client jars must match the version of HDFS that the HDFS Handler is connecting. For a list of the required client jar files by release, see HDFS Handler Client Dependencies.
The default location of the core-site.xml
file is Hadoop_Home
/etc/hadoop
The default locations of the HDFS client jars are the following directories:
Hadoop_Home
/share/hadoop/common/lib/*
Hadoop_Home
/share/hadoop/common/*
Hadoop_Home
/share/hadoop/hdfs/lib/
*
Hadoop_Home
/share/hadoop/hdfs/*
The gg.classpath
must be configured exactly as shown. The path to the core-site.xml
file must contain the path to the directory containing the core-site.xml
file with no wildcard appended. If you include a (*) wildcard in the path to the core-site.xml
file, the file is not picked up. Conversely, the path to the dependency jars must include the (*) wildcard character in order to include all the jar files in that directory in the associated classpath. Do not use *.jar
.
The following is an example of a correctly configured gg.classpath
variable:
gg.classpath=/ggwork/hadoop/hadoop-2.6.0/etc/hadoop:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/lib/*
The HDFS configuration file hdfs-site.xml
must also be in the classpath if Kerberos security is enabled. By default, the hdfs-site.xml
file is located in the Hadoop_Home
/etc/hadoop
directory. If the HDFS Handler is not collocated with Hadoop, either or both files can be copied to another machine.
Parent topic: Setting Up and Running the HDFS Handler
The following are the configurable values for the HDFSHandler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the HDFS Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=hdfs
and the other HDFS properties as follows:
Table 8-1 HDFS Handler Configuration Properties
Property | Optional / Required | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the HDFS Handler. The HDFS Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the HDFS Handler for streaming change data capture into HDFS. |
|
Optional |
|
|
Selects operation ( |
|
Optional |
The default unit of measure is bytes. You can use |
|
Selects the maximum file size of the created HDFS files. |
|
Optional |
Any legal templated string to resolve the target write directory in HDFS. Templates can contain a mix of constants and keywords which are dynamically resolved at runtime to generate the HDFS write directory. |
|
You can use keywords interlaced with constants to dynamically generate the HDFS write directory at runtime, see Generating HDFS File Names Using Template Strings. |
|
Optional |
The default unit of measure is milliseconds. You can stipulate |
File rolling on time is off. |
The timer starts when an HDFS file is created. If the file is still open when the interval elapses, then the file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis. |
|
Optional |
The default unit of measure is milliseconds. You can use |
File inactivity rolling on time is off. |
The timer starts from the latest write to an HDFS file. New writes to an HDFS file restart the counter. If the file is still open when the counter elapses, the HDFS file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis. |
|
Optional |
A string with resolvable keywords and constants used to dynamically generate HDFS file names at runtime. |
|
You can use keywords interlaced with constants to dynamically generate unique HDFS file names at runtime, see Generating HDFS File Names Using Template Strings. File names typically follow the format, |
|
Optional |
|
|
Determines whether data written into HDFS must be partitioned by table. If set to Must be set to |
|
Optional |
|
|
Determines whether HDFS files are rolled in the case of a metadata change. True means the HDFS file is rolled, false means the HDFS file is not rolled. Must be set to |
|
Optional |
|
|
Selects the formatter for the HDFS Handler for how output data is formatted.
|
|
Optional |
|
|
Set to |
Equals one or more column names separated by commas. |
Optional |
Fully qualified table name and column names must exist. |
|
This partitions the data into subdirectories in HDFS in the following format: |
|
Optional |
kerberos |
|
Setting this property to |
|
Optional (Required if
authType=Kerberos) |
Relative or absolute path to a Kerberos |
|
The |
|
Optional (Required if
authType=Kerberos) |
A legal Kerberos principal name like |
|
The Kerberos principal name for Kerberos authentication. |
|
Optional |
|
Set to a legal path in HDFS so that schemas (if available) are written in that HDFS directory. Schemas are currently only available for Avro and JSON formatters. In the case of a metadata change event, the schema is overwritten to reflect the schema change. |
|
Applicable to Sequence File Format only. |
Optional |
|
|
Hadoop Sequence File Compression Type. Applicable only if |
Applicable to Sequence File and writing to HDFS is Avro OCF formats only. |
Optional |
|
|
Hadoop Sequence File Compression Codec. Applicable only if |
Optional |
|
|
Avro OCF Formatter Compression Code. This configuration controls the selection of the compression library to be used for Avro OCF files. Snappy includes native binaries in the Snappy JAR file and performs a Java-native traversal when compressing or decompressing. Use of Snappy may introduce runtime issues and platform porting issues that you may not experience when working with Java. You may need to perform additional testing to ensure that Snappy works on all of your required platforms. Snappy is an open source library, so Oracle cannot guarantee its ability to operate on all of your required platforms. |
|
|
Optional |
A legal URL for connecting to Hive using the Hive JDBC interface. |
|
Only applicable to the Avro OCF Formatter. This configuration value provides a JDBC URL for connectivity to Hive through the Hive JDBC interface. Use of this property requires that you include the Hive JDBC library in the Hive JDBC connectivity can be secured through basic credentials, SSL/TLS, or Kerberos. Configuration properties are provided for the user name and password for basic credentials. See the Hive documentation for how to generate a Hive JDBC URL for SSL/TLS. See the Hive documentation for how to generate a Hive JDBC URL for Kerberos. (If Kerberos is used for Hive JDBC security, it must be enabled for HDFS connectivity. Then the Hive JDBC connection can piggyback on the HDFS Kerberos functionality by using the same Kerberos principal.) |
|
Optional |
A legal user name if the Hive JDBC connection is secured through credentials. |
Java call result from |
Only applicable to the Avro Object Container File OCF Formatter. This property is only relevant if the |
|
Optional |
A legal password if the Hive JDBC connection requires a password. |
None |
Only applicable to the Avro OCF Formatter. This property is only relevant if the |
|
Optional |
The fully qualified Hive JDBC driver class name. |
|
Only applicable to the Avro OCF Formatter. This property is only relevant if the |
|
Optional |
|
|
Applicable only to the HDFS Handler that is not writing an Avro OCF or sequence file to support extract, load, transform (ELT) situations. When set to File rolls can be triggered by any one of the following:
Data files are being loaded into HDFS and a monitor program is monitoring the write directories waiting to consume the data. The monitoring programs use the appearance of a new file as a trigger so that the previous file can be consumed by the consuming application. |
|
Optional |
|
|
Set to use an Setting For most applications setting this property to |
Parent topic: Setting Up and Running the HDFS Handler
The following is a sample configuration for the HDFS Handler from the Java Adapter properties file:
gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.mode=tx gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.pathMappingTemplate=/ogg/${fullyQualifiedTableName gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=delimitedtext
Parent topic: Setting Up and Running the HDFS Handler
The HDFS Handler calls the HDFS flush method on the HDFS write stream to flush data to the HDFS data nodes at the end of each transaction in order to maintain write durability. This is an expensive call and performance can adversely affect, especially in the case of transactions of one or few operations that result in numerous HDFS flush calls.
Performance of the HDFS Handler can be greatly improved by batching multiple small transactions into a single larger transaction. If you require high performance, configure batching functionality for the Replicat process. For more information, see Replicat Grouping.
The HDFS client libraries spawn threads for every HDFS file stream opened by the HDFS Handler. Therefore, the number of threads executing in the JMV grows proportionally to the number of HDFS file streams that are open. Performance of the HDFS Handler may degrade as more HDFS file streams are opened. Configuring the HDFS Handler to write to many HDFS files (due to many source replication tables or extensive use of partitioning) may result in degraded performance. If your use case requires writing to many tables, then Oracle recommends that you enable the roll on time or roll on inactivity features to close HDFS file streams. Closing an HDFS file stream causes the HDFS client threads to terminate, and the associated resources can be reclaimed by the JVM.
Parent topic: Setting Up and Running the HDFS Handler
The HDFS cluster can be secured using Kerberos authentication. The HDFS Handler can connect to Kerberos secured cluster. The HDFS core-site.xml
should be in the handlers classpath with the hadoop.security.authentication
property set to kerberos
and the hadoop.security.authorization
property set to true
. Additionally, you must set the following properties in the HDFS Handler Java configuration file:
gg.handler.name.authType=kerberos gg.handler.name.kerberosPrincipalName=legal Kerberos principal name gg.handler.name.kerberosKeytabFile=path to a keytab file that contains the password for the Kerberos principal so that the HDFS Handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket
You may encounter the inability to decrypt the Kerberos password from the keytab
file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.
Parent topic: Setting Up and Running the HDFS Handler
The HDFS Handler includes specialized functionality to write to HDFS in Avro Object Container File (OCF) format. This Avro OCF is part of the Avro specification and is detailed in the Avro documentation at:
https://avro.apache.org/docs/current/spec.html#Object+Container+Files
Avro OCF format may be a good choice because it:
integrates with Apache Hive (Raw Avro written to HDFS is not supported by Hive.)
provides good support for schema evolution.
Configure the following to enable writing to HDFS in Avro OCF format:
To write row data to HDFS in Avro OCF format, configure the gg.handler.name.format=avro_row_ocf
property.
To write operation data to HDFS is Avro OCF format, configure the gg.handler.name.format=avro_op_ocf
property.
The HDFS and Avro OCF integration includes functionality to create the corresponding tables in Hive and update the schema for metadata change events. The configuration section provides information on the properties to enable integration with Hive. The Oracle GoldenGate Hive integration accesses Hive using the JDBC interface, so the Hive JDBC server must be running to enable this integration.
Parent topic: Using the HDFS Handler
The HDFS Handler can dynamically generate HDFS file names using a template string. The template string allows you to generate a combination of keywords that are dynamically resolved at runtime with static strings to provide you more control of generated HDFS file names. You can control the template file name using the gg.handler.name.fileNameMappingTemplate
configuration property. The default value for this parameters is:
${fullyQualifiedTableName}_${groupName}_${currentTimestamp}.txt
Supported keywords which are dynamically replaced at runtime include the following:
${fullyQualifiedTableName}
The fully qualified table name with period (.) delimiting the names. For example, oracle.test.table1
.
${catalogName}
The catalog name of the source table. For example, oracle
.
${schemaName}
The schema name of the source table. For example, test
.
${tableName}
The short table name of the source table. For example, table1
.
${groupName}
The Replicat process name concatenated with the thread id if using coordinated apply. For example, HDFS001
.
${currentTimestamp}
The default output format for the date time is yyyy-MM-dd_HH-mm-ss.SSS
. For example, 2017-07-05_04-31-23.123
.
Alternatively, your can configure your own format mask for the date using the syntax, ${currentTimestamp[yyyy-MM-dd_HH-mm-ss.SSS]}
. Date time format masks follow the convention in the java.text.SimpleDateFormat
Java class.
${toLowerCase[]}
Converts the argument inside of the square brackets to lower case. Keywords can be nested inside of the square brackets as follows:
${toLowerCase[${fullyQualifiedTableName}]}
This is important because source table names are normalized in Oracle GoldenGate to upper case.
${toUpperCase[]}
Converts the arguments inside of the square brackets to upper case. Keywords can be nested inside of the square brackets.
Following are examples of legal templates and the resolved strings:
Replacement
${schemaName}.${tableName}__${groupName}_${currentTimestamp}.txt
test.table1__HDFS001_2017-07-05_04-31-23.123.txt
${fullyQualifiedTableName}--${currentTimestamp}.avro
oracle.test.table1—2017-07-05_04-31-23.123.avro
${fullyQualifiedTableName}_${currentTimestamp[yyyy-MM-ddTHH-mm-ss.SSS]}.json
oracle.test.table1—2017-07-05T04-31-23.123.json
Be aware of these restrictions when generating HDFS file names using templates:
Generated HDFS file names must be legal HDFS file names.
Oracle strongly recommends that you use ${groupName}
as part of the HDFS file naming template when using coordinated apply and breaking down source table data to different Replicat threads. The group name provides uniqueness of generated HDFS names that ${currentTimestamp}
alone does not guarantee.. HDFS file name collisions result in an abend of the Replicat process.
Parent topic: Using the HDFS Handler
Metadata change events are now handled in the HDFS Handler. The default behavior of the HDFS Handler is to roll the current relevant file in the event of a metadata change event. This behavior allows for the results of metadata changes to at least be separated into different files. File rolling on metadata change is configurable and can be turned off.
To support metadata change events, the process capturing changes in the source database must support both DDL changes and metadata in trail. Oracle GoldenGatedoes not support DDL replication for all database implementations. See the Oracle GoldenGateinstallation and configuration guide for the appropriate database to determine whether DDL replication is supported.
Parent topic: Using the HDFS Handler
The HDFS Handler supports partitioning of table data by one or more column values. The configuration syntax to enable partitioning is the following:
gg.handler.name.partitioner.fully qualified table name=one mor more column names separated by commas
Consider the following example:
gg.handler.hdfs.partitioner.dbo.orders=sales_region
This example can result in the following breakdown of files in HDFS:
/ogg/dbo.orders/par_sales_region=west/data files /ogg/dbo.orders/par_sales_region=east/data files /ogg/dbo.orders/par_sales_region=north/data files /ogg/dbo.orders/par_sales_region=south/data files
You should exercise care when choosing columns for partitioning. The key is to choose columns that contain only a few (10 or less) possible values, and to make sure that those values are also helpful for grouping and analyzing the data. For example, a column of sales regions would be good for partitioning. A column that contains the customers dates of birth would not be good for partitioning. Configuring partitioning on a column that has many possible values can cause problems. A poor choice can result in hundreds of HDFS file streams being opened, and performance may degrade for the reasons discussed in Performance Considerations. Additionally, poor partitioning can result in problems during data analysis. Apache Hive requires that all where
clauses specify partition criteria if the Hive data is partitioned.
Parent topic: Using the HDFS Handler
The Oracle HDFS Handler requires certain HDFS client libraries to be resolved in its classpath as a prerequisite for streaming data to HDFS.
For a list of required client JAR files by version, see HDFS Handler Client Dependencies. The HDFS client jars do not ship with the Oracle GoldenGate for Big Dataproduct. The HDFS Handler supports multiple versions of HDFS, and the HDFS client jars must be the same version as the HDFS version to which the HDFS Handler is connecting. The HDFS client jars are open source and are freely available to download from sites such as the Apache Hadoop site or the maven central repository.
In order to establish connectivity to HDFS, the HDFS core-site.xml
file must be in the classpath of the HDFS Handler. If the core-site.xml
file is not in the classpath, the HDFS client code defaults to a mode that attempts to write to the local file system. Writing to the local file system instead of HDFS can be advantageous for troubleshooting, building a point of contact (POC), or as a step in the process of building an HDFS integration.
Another common issue is that data streamed to HDFS using the HDFS Handler may not be immediately available to Big Data analytic tools such as Hive. This behavior commonly occurs when the HDFS Handler is in possession of an open write stream to an HDFS file. HDFS writes in blocks of 128 MB by default. HDFS blocks under construction are not always visible to analytic tools. Additionally, inconsistencies between file sizes when using the -ls
, -cat
, and -get
commands in the HDFS shell may occur. This is an anomaly of HDFS streaming and is discussed in the HDFS specification. This anomaly of HDFS leads to a potential 128 MB per file blind spot in analytic data. This may not be an issue if you have a steady stream of replication data and do not require low levels of latency for analytic data from HDFS. However, this may be a problem in some use cases because closing the HDFS write stream finalizes the block writing. Data is immediately visible to analytic tools, and file sizing metrics become consistent again. Therefore, the new file rolling feature in the HDFS Handler can be used to close HDFS writes streams, making all data visible.
Important:
The file rolling solution may present its own problems. Extensive use of file rolling can result in many small files in HDFS. Many small files in HDFS may result in performance issues in analytic tools.
You may also notice the HDFS inconsistency problem in the following scenarios.
The HDFS Handler process crashes.
A forced shutdown is called on the HDFS Handler process.
A network outage or other issue causes the HDFS Handler process to abend.
In each of these scenarios, it is possible for the HDFS Handler to end without explicitly closing the HDFS write stream and finalizing the writing block. HDFS in its internal process ultimately recognizes that the write stream has been broken, so HDFS finalizes the write block. In this scenario, you may experience a short term delay before the HDFS process finalizes the write block.
Parent topic: Using the HDFS Handler
It is considered a Big Data best practice for the HDFS cluster to operate on dedicated servers called cluster nodes. Edge nodes are server machines that host the applications to stream data to and retrieve data from the HDFS cluster nodes. Because the HDFS cluster nodes and the edge nodes are different servers, the following benefits are seen:
The HDFS cluster nodes do not compete for resources with the applications interfacing with the cluster.
The requirements for the HDFS cluster nodes and edge nodes probably differ. This physical topology allows the appropriate hardware to be tailored to specific needs.
It is a best practice for the HDFS Handler to be installed and running on an edge node and streaming data to the HDFS cluster using network connection. The HDFS Handler can run on any machine that has network visibility to the HDFS cluster. The installation of the HDFS Handler on an edge node requires that the core-site.xml
files, and the dependency jars are copied to the edge node so that the HDFS Handler can access them. The HDFS Handler can also run collocated on a HDFS cluster node if required.
Parent topic: Using the HDFS Handler
Troubleshooting of the HDFS Handler begins with the contents for the Java log4j
file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Java log4j
log file.
Topics:
Parent topic: Using the HDFS Handler
Problems with the Java classpath are common. The usual indication of a Java classpath problem is a ClassNotFoundException
in the Java log4j
log file. The Java log4j
log file can be used to troubleshoot this issue. Setting the log level to DEBUG
allows for logging of each of the jars referenced in the gg.classpath
object to be logged to the log file. In this way, you can ensure that all of the required dependency jars are resolved by enabling DEBUG
level logging and search the log file for messages, as in the following:
2015-09-21 10:05:10 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/guava-11.0.2.jar
Parent topic: Troubleshooting the HDFS Handler
The contents of the HDFS core-site.xml
file (including default settings) are output to the Java log4j
log file when the logging level is set to DEBUG
or TRACE
. This output shows the connection properties to HDFS. Search for the following in the Java log4j
log file:
2015-09-21 10:05:11 DEBUG HDFSConfiguration:58 - Begin - HDFS configuration object contents for connection troubleshooting.
If the fs.defaultFS
property points to the local file system, then the core-site.xml
file is not properly set in the gg.classpath
property.
Key: [fs.defaultFS] Value: [file:///].
This shows to the fs.defaultFS
property properly pointed at and HDFS host and port.
Key: [fs.defaultFS] Value: [hdfs://hdfshost:9000].
Parent topic: Troubleshooting the HDFS Handler
The Java log4j
log file contains information on the configuration state of the HDFS Handler and the selected formatter. This information is output at the INFO
log level. The output resembles the following:
2015-09-21 10:05:11 INFO AvroRowFormatter:156 - **** Begin Avro Row Formatter - Configuration Summary **** Operation types are always included in the Avro formatter output. The key for insert operations is [I]. The key for update operations is [U]. The key for delete operations is [D]. The key for truncate operations is [T]. Column type mapping has been configured to map source column types to an appropriate corresponding Avro type. Created Avro schemas will be output to the directory [./dirdef]. Created Avro schemas will be encoded using the [UTF-8] character set. In the event of a primary key update, the Avro Formatter will ABEND. Avro row messages will not be wrapped inside a generic Avro message. No delimiter will be inserted after each generated Avro message. **** End Avro Row Formatter - Configuration Summary **** 2015-09-21 10:05:11 INFO HDFSHandler:207 - **** Begin HDFS Handler - Configuration Summary **** Mode of operation is set to tx. Data streamed to HDFS will be partitioned by table. Tokens will be included in the output. The HDFS root directory for writing is set to [/ogg]. The maximum HDFS file size has been set to 1073741824 bytes. Rolling of HDFS files based on time is configured as off. Rolling of HDFS files based on write inactivity is configured as off. Rolling of HDFS files in the case of a metadata change event is enabled. HDFS partitioning information: The HDFS partitioning object contains no partitioning information. HDFS Handler Authentication type has been configured to use [none] **** End HDFS Handler - Configuration Summary ****
Parent topic: Troubleshooting the HDFS Handler