8 Using the HDFS Handler

Learn how to use the HDFS Handler, which is designed to stream change capture data into the Hadoop Distributed File System (HDFS).

Topics:

8.1 Overview

The HDFS is the primary file system for Big Data. Hadoop is typically installed on multiple machines that work together as a Hadoop cluster. Hadoop allows you to store very large amounts of data in the cluster that is horizontally scaled across the machines in the cluster. You can then perform analytics on that data using a variety of Big Data applications.

Parent topic: Using the HDFS Handler

8.2 Writing into HDFS in SequenceFile Format

The HDFS SequenceFile is a flat file consisting of binary key and value pairs. You can enable writing data in SequenceFile format by setting the gg.handler.name.format property to sequencefile. The key part of the record is set to null, and the actual data is set in the value part. For information about Hadoop SequenceFile, see https://wiki.apache.org/hadoop/SequenceFile.

Topics:

Parent topic: Using the HDFS Handler

8.2.1 Integrating with Hive

Oracle GoldenGate for Big Data release does not include a Hive storage handler because the HDFS Handler provides all of the necessary Hive functionality .

You can create a Hive integration to create tables and update table definitions in case of DDL events. This is limited to data formatted in Avro Object Container File format. For more information, see Writing in HDFS in Avro Object Container File Format and HDFS Handler Configuration.

For Hive to consume sequence files, the DDL creates Hive tables including STORED as sequencefile . The following is a sample create table script:

CREATE EXTERNAL TABLE table_name (
  col1 string,
  ...
  ...
  col2 string)
ROW FORMAT DELIMITED
STORED as sequencefile
LOCATION '/path/to/hdfs/file';

Note:

If files are intended to be consumed by Hive, then the gg.handler.name.partitionByTable property should be set to true.

Parent topic: Writing into HDFS in SequenceFile Format

8.2.2 Understanding the Data Format

The data written in the value part of each record and is in delimited text format. All of the options described in the Using the Delimited Text Formatter section are applicable to HDFS SequenceFile when writing data to it.

For example:

gg.handler.name.format=sequencefile
gg.handler.name.format.includeColumnNames=true
gg.handler.name.format.includeOpType=true
gg.handler.name.format.includeCurrentTimestamp=true
gg.handler.name.format.updateOpKey=U

Parent topic: Writing into HDFS in SequenceFile Format

8.3 Setting Up and Running the HDFS Handler

To run the HDFS Handler, a Hadoop single instance or Hadoop cluster must be installed, running, and network-accessible from the machine running the HDFS Handler. Apache Hadoop is open source and you can download it from:

http://hadoop.apache.org/

Follow the Getting Started links for information on how to install a single-node cluster (for pseudo-distributed operation mode) or a clustered setup (for fully-distributed operation mode).

Instructions for configuring the HDFS Handler components and running the handler are described in the following sections.

Parent topic: Using the HDFS Handler

8.3.1 Classpath Configuration

For the HDFS Handler to connect to HDFS and run, the HDFS core-site.xml file and the HDFS client jars must be configured in gg.classpath variable. The HDFS client jars must match the version of HDFS that the HDFS Handler is connecting. For a list of the required client jar files by release, see HDFS Handler Client Dependencies.

The default location of the core-site.xml file is Hadoop_Home/etc/hadoop

The default locations of the HDFS client jars are the following directories:

Hadoop_Home/share/hadoop/common/lib/*

Hadoop_Home/share/hadoop/common/*

Hadoop_Home/share/hadoop/hdfs/lib/*

Hadoop_Home/share/hadoop/hdfs/*

The gg.classpath must be configured exactly as shown. The path to the core-site.xml file must contain the path to the directory containing the core-site.xmlfile with no wildcard appended. If you include a (*) wildcard in the path to the core-site.xml file, the file is not picked up. Conversely, the path to the dependency jars must include the (*) wildcard character in order to include all the jar files in that directory in the associated classpath. Do not use *.jar.

The following is an example of a correctly configured gg.classpath variable:

gg.classpath=/ggwork/hadoop/hadoop-2.6.0/etc/hadoop:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/*:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/hdfs/lib/*

The HDFS configuration file hdfs-site.xml must also be in the classpath if Kerberos security is enabled. By default, the hdfs-site.xml file is located in the Hadoop_Home/etc/hadoop directory. If the HDFS Handler is not collocated with Hadoop, either or both files can be copied to another machine.

Parent topic: Setting Up and Running the HDFS Handler

8.3.2 HDFS Handler Configuration

The following are the configurable values for the HDFSHandler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the HDFS Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=hdfs and the other HDFS properties as follows:

Table 8-1 HDFS Handler Configuration Properties

Property	Optional / Required	Legal Values	Default	Explanation
`gg.handlerlist`	Required	Any string	None	Provides a name for the HDFS Handler. The HDFS Handler name then becomes part of the property names listed in this table.
`gg.handler.name.type`	Required	`hdfs`	None	Selects the HDFS Handler for streaming change data capture into HDFS.
`gg.handler.name.mode`	Optional	`tx` \| `op`	`op`	Selects operation (`op`) mode or transaction (`tx`) mode for the handler. In almost all scenarios, transaction mode results in better performance.
`gg.handler.name.maxFileSize`	Optional	The default unit of measure is bytes. You can use `k`, `m`, or `g` to specify kilobytes, megabytes, or gigabytes. Examples of legal values include `10000`, `10k`, `100m`, `1.1g`.	`1g`	Selects the maximum file size of the created HDFS files.
`gg.handler.name.pathMappingTemplate`	Optional	Any legal templated string to resolve the target write directory in HDFS. Templates can contain a mix of constants and keywords which are dynamically resolved at runtime to generate the HDFS write directory.	`/ogg/${toLowerCase[${fullyQualifiedTableName}]}`	You can use keywords interlaced with constants to dynamically generate the HDFS write directory at runtime, see Generating HDFS File Names Using Template Strings.
`gg.handler.name.fileRollInterval`	Optional	The default unit of measure is milliseconds. You can stipulate `ms`, `s`, `m`, `h` to signify milliseconds, seconds, minutes, or hours respectively. Examples of legal values include `10000`, 10000ms, `10s`, `10m`, or `1.5h`. Values of `0` or less indicate that file rolling on time is turned off.	File rolling on time is off.	The timer starts when an HDFS file is created. If the file is still open when the interval elapses, then the file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis.
`gg.handler.name.inactivityRollInterval`	Optional	The default unit of measure is milliseconds. You can use `ms`, `s`, `m`, `h` to specify milliseconds, seconds, minutes, or hours. Examples of legal values include 10000, 10000ms, 10s, 10, 5m, or 1h. Values of 0 or less indicate that file inactivity rolling on time is turned off.	File inactivity rolling on time is off.	The timer starts from the latest write to an HDFS file. New writes to an HDFS file restart the counter. If the file is still open when the counter elapses, the HDFS file is closed. A new file is not immediately opened. New HDFS files are created on a just-in-time basis.
`gg.handler.name.fileNameMappingTemplate`	Optional	A string with resolvable keywords and constants used to dynamically generate HDFS file names at runtime.	`${fullyQualifiedTableName}_${groupName}_${currentTimeStamp}.txt`	You can use keywords interlaced with constants to dynamically generate unique HDFS file names at runtime, see Generating HDFS File Names Using Template Strings. File names typically follow the format, `${fullyQualifiedTableName}_${groupName}_${currentTimeStamp}{.txt}`.
`gg.handler.name.partitionByTable`	Optional	`true` \| `false`	`true` (data is partitioned by table)	Determines whether data written into HDFS must be partitioned by table. If set to `true`, then data for different tables are written to different HDFS files. If set to `false`, then data from different tables is interlaced in the same HDFS file. Must be set to `true` to use the Avro Object Container File Formatter. If set to `false`, a configuration exception occurs at initialization.
`gg.handler.name.rollOnMetadataChange`	Optional	`true` \| `false`	`true` (HDFS files are rolled on a metadata change event)	Determines whether HDFS files are rolled in the case of a metadata change. True means the HDFS file is rolled, false means the HDFS file is not rolled. Must be set to `true` to use the Avro Object Container File Formatter. If set to `false`, a configuration exception occurs at initialization.
`gg.handler.name.format`	Optional	`delimitedtext` \| `json` \| `json_row` \| `xml` \| `avro_row` \| `avro_op` \| `avro_row_ocf` \| `avro_op_ocf` \| `sequencefile`	`delimitedtext`	Selects the formatter for the HDFS Handler for how output data is formatted. `delimitedtext`: Delimited text `json`: JSON `json_row`: JSON output modeling row data `xml`: XML `avro_row`: Avro in row compact format `avro_op`: Avro in operation more verbose format. `avro_row_ocf`: Avro in the row compact format written into HDFS in the Avro Object Container File (OCF) format. `avro_op_ocf`: Avro in the more verbose format written into HDFS in the Avro Object Container File format. `sequencefile`: Delimited text written in sequence into HDFS is sequence file format.
`gg.handler.name.includeTokens`	Optional	`true` \| `false`	`false`	Set to `true` to include the tokens field and tokens key/values in the output. Set to `false` to suppress tokens output.
`gg.handler.name.partitioner.fully_qualified_table_ name` Equals one or more column names separated by commas.	Optional	Fully qualified table name and column names must exist.	`-`	This partitions the data into subdirectories in HDFS in the following format: `par_{column name}={column value}`
`gg.handler.name.authType`	Optional	kerberos	`none`	Setting this property to `kerberos` enables Kerberos authentication.
`gg.handler.name.kerberosKeytabFile`	Optional (Required if authType=Kerberos )	Relative or absolute path to a Kerberos `keytab` file.	`-`	The `keytab` file allows the HDFS Handler to access a password to perform a `kinit` operation for Kerberos security.
`gg.handler.name.kerberosPrincipal`	Optional (Required if authType=Kerberos )	A legal Kerberos principal name like `user/FQDN@MY.REALM`.	`-`	The Kerberos principal name for Kerberos authentication.
`gg.handler.name.schemaFilePath`	Optional		`null`	Set to a legal path in HDFS so that schemas (if available) are written in that HDFS directory. Schemas are currently only available for Avro and JSON formatters. In the case of a metadata change event, the schema is overwritten to reflect the schema change.
`gg.handler.name.compressionType` Applicable to Sequence File Format only.	Optional	`block \| none \| record`	`none`	Hadoop Sequence File Compression Type. Applicable only if `gg.handler.name.format` is set to `sequencefile`
`gg.handler.name.compressionCodec` Applicable to Sequence File and writing to HDFS is Avro OCF formats only.	Optional	`org.apache.hadoop.io.compress.DefaultCodec \| org.apache.hadoop.io.compress. BZip2Codec \| org.apache.hadoop.io.compress.SnappyCodec \| org.apache.hadoop.io.compress. GzipCodec`	`org.apache.hadoop.io.compress.DefaultCodec`	Hadoop Sequence File Compression Codec. Applicable only if `gg.handler.name.format` is set to `sequencefile`
	Optional	`null \| snappy \| bzip2 \| xz \| deflate`	`null`	Avro OCF Formatter Compression Code. This configuration controls the selection of the compression library to be used for Avro OCF files. Snappy includes native binaries in the Snappy JAR file and performs a Java-native traversal when compressing or decompressing. Use of Snappy may introduce runtime issues and platform porting issues that you may not experience when working with Java. You may need to perform additional testing to ensure that Snappy works on all of your required platforms. Snappy is an open source library, so Oracle cannot guarantee its ability to operate on all of your required platforms.
`gg.handler.name.hiveJdbcUrl`	Optional	A legal URL for connecting to Hive using the Hive JDBC interface.	`null` (Hive integration disabled)	Only applicable to the Avro OCF Formatter. This configuration value provides a JDBC URL for connectivity to Hive through the Hive JDBC interface. Use of this property requires that you include the Hive JDBC library in the `gg.classpath`. Hive JDBC connectivity can be secured through basic credentials, SSL/TLS, or Kerberos. Configuration properties are provided for the user name and password for basic credentials. See the Hive documentation for how to generate a Hive JDBC URL for SSL/TLS. See the Hive documentation for how to generate a Hive JDBC URL for Kerberos. (If Kerberos is used for Hive JDBC security, it must be enabled for HDFS connectivity. Then the Hive JDBC connection can piggyback on the HDFS Kerberos functionality by using the same Kerberos principal.)
`gg.handler.name.hiveJdbcUsername`	Optional	A legal user name if the Hive JDBC connection is secured through credentials.	Java call result from `System.getProperty`(`user.name`)	Only applicable to the Avro Object Container File OCF Formatter. This property is only relevant if the `hiveJdbcUrl`property is set. It may be required in your environment when the Hive JDBC connection is secured through credentials. Hive requires that Hive DDL operations be associated with a user. If you do not set the value, it defaults to the result of the following Java call: `System.getProperty`(`user.name`)
`gg.handler.name.hiveJdbcPassword`	Optional	A legal password if the Hive JDBC connection requires a password.	None	Only applicable to the Avro OCF Formatter. This property is only relevant if the `hiveJdbcUrl` property is set. It may be required in your environment when the Hive JDBC connection is secured through credentials. This property is required if Hive is configured to require passwords for the JDBC connection.
`gg.handler.name.hiveJdbcDriver`	Optional	The fully qualified Hive JDBC driver class name.	`org.apache.hive.jdbc.HiveDriver`	Only applicable to the Avro OCF Formatter. This property is only relevant if the `hiveJdbcUrl` property is set. The default is the Hive Hadoop2 JDBC driver name. Typically, this property does not require configuration and is provided for use when Apache Hive introduces a new JDBC driver class.
`gg.handler.name.openNextFileAtRoll`	Optional	`true \| false`	`false`	Applicable only to the HDFS Handler that is not writing an Avro OCF or sequence file to support extract, load, transform (ELT) situations. When set to `true`, this property creates a new file immediately on the occurrence of a file roll. File rolls can be triggered by any one of the following: Metadata change File roll interval elapsed Inactivity interval elapsed Data files are being loaded into HDFS and a monitor program is monitoring the write directories waiting to consume the data. The monitoring programs use the appearance of a new file as a trigger so that the previous file can be consumed by the consuming application.
`gg.handler.name.hsync`	Optional	`true` \| `false`	`false`	Set to use an `hflush` call to ensure that data is transferred from the HDFS Handler to the HDFS cluster. When set to `false`, `hflush` is called on open HDFS write streams at transaction commit to ensure write durability. Setting `hsync` to `true` calls `hsync` instead of `hflush` at transaction commit. Using `hsync` ensures that data has moved to the HDFS cluster and that the data is written to disk. This provides a higher level of write durability though it adversely effects performance. Also, it does not make the write data immediately available to analytic tools. For most applications setting this property to `false` is appropriate.

Parent topic: Setting Up and Running the HDFS Handler

8.3.3 Review a Sample Configuration

The following is a sample configuration for the HDFS Handler from the Java Adapter properties file:

gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.mode=tx
gg.handler.hdfs.includeTokens=false
gg.handler.hdfs.maxFileSize=1g
gg.handler.hdfs.pathMappingTemplate=/ogg/${fullyQualifiedTableName
gg.handler.hdfs.fileRollInterval=0
gg.handler.hdfs.inactivityRollInterval=0
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.authType=none
gg.handler.hdfs.format=delimitedtext

Parent topic: Setting Up and Running the HDFS Handler

8.3.4 Performance Considerations

The HDFS Handler calls the HDFS flush method on the HDFS write stream to flush data to the HDFS data nodes at the end of each transaction in order to maintain write durability. This is an expensive call and performance can adversely affect, especially in the case of transactions of one or few operations that result in numerous HDFS flush calls.

Performance of the HDFS Handler can be greatly improved by batching multiple small transactions into a single larger transaction. If you require high performance, configure batching functionality for the Replicat process. For more information, see Replicat Grouping.

The HDFS client libraries spawn threads for every HDFS file stream opened by the HDFS Handler. Therefore, the number of threads executing in the JMV grows proportionally to the number of HDFS file streams that are open. Performance of the HDFS Handler may degrade as more HDFS file streams are opened. Configuring the HDFS Handler to write to many HDFS files (due to many source replication tables or extensive use of partitioning) may result in degraded performance. If your use case requires writing to many tables, then Oracle recommends that you enable the roll on time or roll on inactivity features to close HDFS file streams. Closing an HDFS file stream causes the HDFS client threads to terminate, and the associated resources can be reclaimed by the JVM.

Parent topic: Setting Up and Running the HDFS Handler

8.3.5 Security

The HDFS cluster can be secured using Kerberos authentication. The HDFS Handler can connect to Kerberos secured cluster. The HDFS core-site.xml should be in the handlers classpath with the hadoop.security.authentication property set to kerberos and the hadoop.security.authorization property set to true. Additionally, you must set the following properties in the HDFS Handler Java configuration file:

gg.handler.name.authType=kerberos
gg.handler.name.kerberosPrincipalName=legal Kerberos principal name
gg.handler.name.kerberosKeytabFile=path to a keytab file that contains the password for the Kerberos principal so that the HDFS Handler can programmatically perform the Kerberos kinit operations to obtain a Kerberos ticket

You may encounter the inability to decrypt the Kerberos password from the keytab file. This causes the Kerberos authentication to fall back to interactive mode which cannot work because it is being invoked programmatically. The cause of this problem is that the Java Cryptography Extension (JCE) is not installed in the Java Runtime Environment (JRE). Ensure that the JCE is loaded in the JRE, see http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html.

Parent topic: Setting Up and Running the HDFS Handler

8.4 Writing in HDFS in Avro Object Container File Format

The HDFS Handler includes specialized functionality to write to HDFS in Avro Object Container File (OCF) format. This Avro OCF is part of the Avro specification and is detailed in the Avro documentation at:

https://avro.apache.org/docs/current/spec.html#Object+Container+Files

Avro OCF format may be a good choice because it:

integrates with Apache Hive (Raw Avro written to HDFS is not supported by Hive.)
provides good support for schema evolution.

Configure the following to enable writing to HDFS in Avro OCF format:

To write row data to HDFS in Avro OCF format, configure the gg.handler.name.format=avro_row_ocf property.

To write operation data to HDFS is Avro OCF format, configure the gg.handler.name.format=avro_op_ocf property.

The HDFS and Avro OCF integration includes functionality to create the corresponding tables in Hive and update the schema for metadata change events. The configuration section provides information on the properties to enable integration with Hive. The Oracle GoldenGate Hive integration accesses Hive using the JDBC interface, so the Hive JDBC server must be running to enable this integration.

Parent topic: Using the HDFS Handler

8.5 Generating HDFS File Names Using Template Strings

The HDFS Handler can dynamically generate HDFS file names using a template string. The template string allows you to generate a combination of keywords that are dynamically resolved at runtime with static strings to provide you more control of generated HDFS file names. You can control the template file name using the gg.handler.name.fileNameMappingTemplate configuration property. The default value for this parameters is:

${fullyQualifiedTableName}_${groupName}_${currentTimestamp}.txt

Supported keywords which are dynamically replaced at runtime include the following:

Keyword

Replacement

${fullyQualifiedTableName}

The fully qualified table name with period (.) delimiting the names. For example, oracle.test.table1.

${catalogName}

The catalog name of the source table. For example, oracle.

${schemaName}

The schema name of the source table. For example, test.

${tableName}

The short table name of the source table. For example, table1.

${groupName}

The Replicat process name concatenated with the thread id if using coordinated apply. For example, HDFS001.

${currentTimestamp}

The default output format for the date time is yyyy-MM-dd_HH-mm-ss.SSS. For example, 2017-07-05_04-31-23.123.

Alternatively, your can configure your own format mask for the date using the syntax, ${currentTimestamp[yyyy-MM-dd_HH-mm-ss.SSS]}. Date time format masks follow the convention in the java.text.SimpleDateFormat Java class.

${toLowerCase[]}

Converts the argument inside of the square brackets to lower case. Keywords can be nested inside of the square brackets as follows:

${toLowerCase[${fullyQualifiedTableName}]}

This is important because source table names are normalized in Oracle GoldenGate to upper case.

${toUpperCase[]}

Converts the arguments inside of the square brackets to upper case. Keywords can be nested inside of the square brackets.

Following are examples of legal templates and the resolved strings:

Legal Template: Replacement
${schemaName}.${tableName}__${groupName}_${currentTimestamp}.txt: test.table1__HDFS001_2017-07-05_04-31-23.123.txt
${fullyQualifiedTableName}--${currentTimestamp}.avro: oracle.test.table1—2017-07-05_04-31-23.123.avro
${fullyQualifiedTableName}_${currentTimestamp[yyyy-MM-ddTHH-mm-ss.SSS]}.json: oracle.test.table1—2017-07-05T04-31-23.123.json

Be aware of these restrictions when generating HDFS file names using templates:

Generated HDFS file names must be legal HDFS file names.
Oracle strongly recommends that you use ${groupName} as part of the HDFS file naming template when using coordinated apply and breaking down source table data to different Replicat threads. The group name provides uniqueness of generated HDFS names that ${currentTimestamp} alone does not guarantee.. HDFS file name collisions result in an abend of the Replicat process.

Parent topic: Using the HDFS Handler

8.6 Metadata Change Events

Metadata change events are now handled in the HDFS Handler. The default behavior of the HDFS Handler is to roll the current relevant file in the event of a metadata change event. This behavior allows for the results of metadata changes to at least be separated into different files. File rolling on metadata change is configurable and can be turned off.

To support metadata change events, the process capturing changes in the source database must support both DDL changes and metadata in trail. Oracle GoldenGatedoes not support DDL replication for all database implementations. See the Oracle GoldenGateinstallation and configuration guide for the appropriate database to determine whether DDL replication is supported.

Parent topic: Using the HDFS Handler

8.7 Partitioning

The HDFS Handler supports partitioning of table data by one or more column values. The configuration syntax to enable partitioning is the following:

gg.handler.name.partitioner.fully qualified table name=one mor more column names separated by commas

Consider the following example:

gg.handler.hdfs.partitioner.dbo.orders=sales_region

This example can result in the following breakdown of files in HDFS:

/ogg/dbo.orders/par_sales_region=west/data files
/ogg/dbo.orders/par_sales_region=east/data files
/ogg/dbo.orders/par_sales_region=north/data files
/ogg/dbo.orders/par_sales_region=south/data files

You should exercise care when choosing columns for partitioning. The key is to choose columns that contain only a few (10 or less) possible values, and to make sure that those values are also helpful for grouping and analyzing the data. For example, a column of sales regions would be good for partitioning. A column that contains the customers dates of birth would not be good for partitioning. Configuring partitioning on a column that has many possible values can cause problems. A poor choice can result in hundreds of HDFS file streams being opened, and performance may degrade for the reasons discussed in Performance Considerations. Additionally, poor partitioning can result in problems during data analysis. Apache Hive requires that all where clauses specify partition criteria if the Hive data is partitioned.

Parent topic: Using the HDFS Handler

8.8 HDFS Additional Considerations

The Oracle HDFS Handler requires certain HDFS client libraries to be resolved in its classpath as a prerequisite for streaming data to HDFS.

For a list of required client JAR files by version, see HDFS Handler Client Dependencies. The HDFS client jars do not ship with the Oracle GoldenGate for Big Dataproduct. The HDFS Handler supports multiple versions of HDFS, and the HDFS client jars must be the same version as the HDFS version to which the HDFS Handler is connecting. The HDFS client jars are open source and are freely available to download from sites such as the Apache Hadoop site or the maven central repository.

In order to establish connectivity to HDFS, the HDFS core-site.xml file must be in the classpath of the HDFS Handler. If the core-site.xml file is not in the classpath, the HDFS client code defaults to a mode that attempts to write to the local file system. Writing to the local file system instead of HDFS can be advantageous for troubleshooting, building a point of contact (POC), or as a step in the process of building an HDFS integration.

Another common issue is that data streamed to HDFS using the HDFS Handler may not be immediately available to Big Data analytic tools such as Hive. This behavior commonly occurs when the HDFS Handler is in possession of an open write stream to an HDFS file. HDFS writes in blocks of 128 MB by default. HDFS blocks under construction are not always visible to analytic tools. Additionally, inconsistencies between file sizes when using the -ls, -cat, and -get commands in the HDFS shell may occur. This is an anomaly of HDFS streaming and is discussed in the HDFS specification. This anomaly of HDFS leads to a potential 128 MB per file blind spot in analytic data. This may not be an issue if you have a steady stream of replication data and do not require low levels of latency for analytic data from HDFS. However, this may be a problem in some use cases because closing the HDFS write stream finalizes the block writing. Data is immediately visible to analytic tools, and file sizing metrics become consistent again. Therefore, the new file rolling feature in the HDFS Handler can be used to close HDFS writes streams, making all data visible.

Important:

The file rolling solution may present its own problems. Extensive use of file rolling can result in many small files in HDFS. Many small files in HDFS may result in performance issues in analytic tools.

You may also notice the HDFS inconsistency problem in the following scenarios.

The HDFS Handler process crashes.
A forced shutdown is called on the HDFS Handler process.
A network outage or other issue causes the HDFS Handler process to abend.

In each of these scenarios, it is possible for the HDFS Handler to end without explicitly closing the HDFS write stream and finalizing the writing block. HDFS in its internal process ultimately recognizes that the write stream has been broken, so HDFS finalizes the write block. In this scenario, you may experience a short term delay before the HDFS process finalizes the write block.

Parent topic: Using the HDFS Handler

8.9 Best Practices

It is considered a Big Data best practice for the HDFS cluster to operate on dedicated servers called cluster nodes. Edge nodes are server machines that host the applications to stream data to and retrieve data from the HDFS cluster nodes. Because the HDFS cluster nodes and the edge nodes are different servers, the following benefits are seen:

The HDFS cluster nodes do not compete for resources with the applications interfacing with the cluster.
The requirements for the HDFS cluster nodes and edge nodes probably differ. This physical topology allows the appropriate hardware to be tailored to specific needs.

It is a best practice for the HDFS Handler to be installed and running on an edge node and streaming data to the HDFS cluster using network connection. The HDFS Handler can run on any machine that has network visibility to the HDFS cluster. The installation of the HDFS Handler on an edge node requires that the core-site.xml files, and the dependency jars are copied to the edge node so that the HDFS Handler can access them. The HDFS Handler can also run collocated on a HDFS cluster node if required.

Parent topic: Using the HDFS Handler

8.10 Troubleshooting the HDFS Handler

Troubleshooting of the HDFS Handler begins with the contents for the Java log4j file. Follow the directions in the Java Logging Configuration to configure the runtime to correctly generate the Java log4j log file.

Topics:

Parent topic: Using the HDFS Handler

8.10.1 Java Classpath

Problems with the Java classpath are common. The usual indication of a Java classpath problem is a ClassNotFoundException in the Java log4j log file. The Java log4j log file can be used to troubleshoot this issue. Setting the log level to DEBUG allows for logging of each of the jars referenced in the gg.classpath object to be logged to the log file. In this way, you can ensure that all of the required dependency jars are resolved by enabling DEBUG level logging and search the log file for messages, as in the following:

2015-09-21 10:05:10 DEBUG ConfigClassPath:74 - ...adding to classpath: url="file:/ggwork/hadoop/hadoop-2.6.0/share/hadoop/common/lib/guava-11.0.2.jar

Parent topic: Troubleshooting the HDFS Handler

8.10.2 HDFS Connection Properties

The contents of the HDFS core-site.xml file (including default settings) are output to the Java log4j log file when the logging level is set to DEBUG or TRACE. This output shows the connection properties to HDFS. Search for the following in the Java log4j log file:

2015-09-21 10:05:11 DEBUG HDFSConfiguration:58 - Begin - HDFS configuration object contents for connection troubleshooting.

If the fs.defaultFS property points to the local file system, then the core-site.xml file is not properly set in the gg.classpath property.

  Key: [fs.defaultFS] Value: [file:///].

This shows to the fs.defaultFS property properly pointed at and HDFS host and port.

Key: [fs.defaultFS] Value: [hdfs://hdfshost:9000].

Parent topic: Troubleshooting the HDFS Handler

8.10.3 Handler and Formatter Configuration

The Java log4j log file contains information on the configuration state of the HDFS Handler and the selected formatter. This information is output at the INFO log level. The output resembles the following:

2015-09-21 10:05:11 INFO  AvroRowFormatter:156 - **** Begin Avro Row Formatter -
 Configuration Summary ****
  Operation types are always included in the Avro formatter output.
    The key for insert operations is [I].
    The key for update operations is [U].
    The key for delete operations is [D].
    The key for truncate operations is [T].
  Column type mapping has been configured to map source column types to an
 appropriate corresponding Avro type.
  Created Avro schemas will be output to the directory [./dirdef].
  Created Avro schemas will be encoded using the [UTF-8] character set.
  In the event of a primary key update, the Avro Formatter will ABEND.
  Avro row messages will not be wrapped inside a generic Avro message.
  No delimiter will be inserted after each generated Avro message.
**** End Avro Row Formatter - Configuration Summary ****
 
2015-09-21 10:05:11 INFO  HDFSHandler:207 - **** Begin HDFS Handler -
 Configuration Summary ****
  Mode of operation is set to tx.
  Data streamed to HDFS will be partitioned by table.
  Tokens will be included in the output.
  The HDFS root directory for writing is set to [/ogg].
  The maximum HDFS file size has been set to 1073741824 bytes.
  Rolling of HDFS files based on time is configured as off.
  Rolling of HDFS files based on write inactivity is configured as off.
  Rolling of HDFS files in the case of a metadata change event is enabled.
  HDFS partitioning information:
    The HDFS partitioning object contains no partitioning information.
HDFS Handler Authentication type has been configured to use [none]
**** End HDFS Handler - Configuration Summary ****

Parent topic: Troubleshooting the HDFS Handler