4 Using the Elasticsearch Handler

Learn how to use the Elasticsearch Handler, which allows you to store, search, and analyze large volumes of data quickly and in near real time.

Topics:

4.1 Overview

Elasticsearch is a highly scalable open-source full-text search and analytics engine. Elasticsearch allows you to store, search, and analyze large volumes of data quickly and in near real time. It is generally used as the underlying engine or technology that drives applications with complex search features.

The Elasticsearch Handler uses the Elasticsearch Java client to connect and receive data into Elasticsearch node, see https://www.elastic.co.

4.2 Detailing the Functionality

Topics:

4.2.1 About the Elasticsearch Version Property

The Elasticsearch Handler property gg.handler.name.version should be set according to the version of the Elasticsearch cluster. The Elasticsearch Handler uses a Java Transport client, which must have the same major version (such as, 2.x, or 5.x) as the nodes in the cluster. The Elasticsearch Handler can connect to clusters that have a different minor version (such as, 2.3.x) though new functionality may not be supported.

4.2.2 About the Index and Type

An Elasticsearch index is a collection of documents with similar characteristics. An index can only be created in lowercase. An Elasticsearch type is a logical group within an index. All the documents within an index or type should have same number and type of fields.

The Elasticsearch Handler maps the source trail schema concatenated with source trail table name to construct the index. For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.

The Elasticsearch Handler maps the source table name to the Elasticsearch type. The type name is case-sensitive.

Table 4-1 Elasticsearch Mapping

Source Trail Elasticsearch Index Elasticsearch Type

schema.tablename

schema_tablename

tablename

catalog.schema.tablename

catalog_schema_tablename

tablename

If an index does not already exist in the Elasticsearch cluster, a new index is created when Elasticsearch Handler receives (INSERT or UPDATE operation in source trail) data.

4.2.3 About the Document

An Elasticsearch document is a basic unit of information that can be indexed. Within an index or type, you can store as many documents as you want. Each document has an unique identifier based on the _id field.

The Elasticsearch Handler maps the source trail primary key column value as the document identifier.

4.2.4 About the Primary Key Update

The Elasticsearch document identifier is created based on the source table's primary key column value. The document identifier cannot be modified. The Elasticsearch handler processes a source primary key's update operation by performing a DELETE followed by an INSERT. While performing the INSERT, there is a possibility that the new document may contain fewer fields than required. For the INSERT operation to contain all the fields in the source table, enable trail Extract to capture the full data before images for update operations or use GETBEFORECOLS to write the required column’s before images.

4.2.5 About the Data Types

Elasticsearch supports the following data types:

  • 32-bit integer

  • 64-bit integer

  • Double

  • Date

  • String

  • Binary

4.2.6 Operation Mode

The Elasticsearch Handler uses the operation mode for better performance. The gg.handler.name.mode property is not used by the handler.

4.2.7 Operation Processing Support

The Elasticsearch Handler maps the source table name to the Elasticsearch type. The type name is case-sensitive.

For three-part table names in source trail, the index is constructed by concatenating source catalog, schema, and table name.

INSERT

The Elasticsearch Handler creates a new index if the index does not exist, and then inserts a new document.

UPDATE

If an Elasticsearch index or document exists, the document is updated. If an Elasticsearch index or document does not exist, a new index is created and the column values in the UPDATE operation are inserted as a new document.

DELETE

If an Elasticsearch index or document exists, the document is deleted. If Elasticsearch index or document does not exist, a new index is created with zero fields.

The TRUNCATE operation is not supported.

4.2.8 About the Connection

A cluster is a collection of one or more nodes (servers) that holds the entire data. It provides federated indexing and search capabilities across all nodes.

A node is a single server that is part of the cluster, stores the data, and participates in the cluster’s indexing and searching.

The Elasticsearch Handler property gg.handler.name.ServerAddressList can be set to point to the nodes available in the cluster.

4.3 Setting Up and Running the Elasticsearch Handler

You must ensure that the Elasticsearch cluster is setup correctly and the cluster is up and running, see https://www.elastic.co/guide/en/elasticsearch/reference/current/_installation.html. Alternatively, you can use Kibana to verify the setup.

Set the Classpath

The property gg.classpath must include all the jars required by the Java transport client. For a listing of the required client JAR files by version, see Elasticsearch Handler Client Dependencies.

Default location of 2.X JARs:

        Elasticsearch_Home/lib/*
        Elasticsearch_Home/plugins/shield/*
Default location of 5.X JARs:

  Elasticsearch_Home/lib/*
        Elasticsearch_Home/plugins/x-pack/*
        Elasticsearch_Home/modules/transport-netty3/*       
        Elasticsearch_Home/modules/transport-netty4/*
        Elasticsearch_Home/modules/reindex/*                

The inclusion of the * wildcard in the path can include the * wildcard character in order to include all of the JAR files in that directory in the associated classpath. Do not use *.jar.

The following is an example of the correctly configured classpath:

     gg.classpath=Elasticsearch_Home/lib/*

Topics:

4.3.1 Configuring the Elasticsearch Handler

The following are the configurable values for the Elasticsearch handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the Elasticsearch Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=elasticsearch and the other Elasticsearch properties as follows:

Table 4-2 Elasticsearch Handler Configuration Properties

Properties Required/ Optional Legal Values Default Explanation
gg.handlerlist

Required

Name (any name of your choice)

None

The list of handlers to be used.

gg.handler.name.type

Required

elasticsearch

None

Type of handler to use. For example, Elasticsearch, Kafka, Flume, or HDFS.

gg.handler.name.ServerAddressList

Optional

Server:Port[, Server:Port]

localhost:9300

Comma separated list of contact points of the nodes to connect to the Elasticsearch cluster.

gg.handler.name.clientSettingsFile

Required

Transport client properties file.

None

The filename in classpath that holds Elasticsearch transport client properties used by the Elasticsearch Handler.

gg.handler.name.version

Optional

2.x | 5.x

2.x

The version of the transport client used by the Elasticsearch Handler, this should be compatible with the Elasticsearch cluster.

gg.handler.name.bulkWrite

Optional

true | false

false

When this property is true, the Elasticsearch Handler uses the bulk write API to ingest data into Elasticsearch cluster. The batch size of bulk write can be controlled using the MAXTRANSOPS Replicat parameter.

gg.handler.name.numberAsString

Optional

true | false

false

When this property is true, the Elasticsearch Handler would receive all the number column values (Long, Integer, or Double) in the source trail as strings  into the Elasticsearch cluster.

gg.handler.elasticsearch.upsert

Optional

true | false

true

When this property is true, a new document is inserted if the document does not already exist when performing an UPDATE operation.

Example 4-1 Sample Handler Properties file:

For 2.x Elasticsearch cluster:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=2.x
gg.classpath=/path/to/elastic/lib/*

For 2.x Elasticsearch cluster with Shield:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=2.x
gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/plugins/shield/*

For 5.x Elasticsearch cluster:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=5.x
gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/modules/transport-netty4/*:/path/to/elastic/modules/reindex/*

For 5.x Elasticsearch cluster with x-pack:

gg.handlerlist=elasticsearch
gg.handler.elasticsearch.type=elasticsearch
gg.handler.elasticsearch.ServerAddressList=localhost:9300
gg.handler.elasticsearch.clientSettingsFile=client.properties
gg.handler.elasticsearch.version=5.x
gg.classpath=/path/to/elastic/lib/*:/path/to/elastic/plugins/x-pack/*:/path/to/elastic/modules/transport-netty4/*:/path/to/elastic/modules/reindex/*

Sample Replicat configuration and a Java Adapter Properties files can be found at the following directory:

GoldenGate_install_directory/AdapterExamples/big-data/elasticsearch

4.3.2 About the Transport Client Settings Properties File

The Elasticsearch Handler uses a Java Transport client to interact with Elasticsearch cluster. The Elasticsearch cluster may have addional plug-ins like shield or x-pack, which may require additional configuration.

The gg.handler.name.clientSettingsFile property should point to a file that has additional client settings based on the version of Elasticsearch cluster. The Elasticsearch Handler attempts to locate and load the client settings file using the Java classpath. Te Java classpath must include the directory containing the properties file.

The client properties file for Elasticsearch 2.x(without any plug-in) is:

cluster.name=Elasticsearch_cluster_name

The client properties file for Elasticsearch 2.X with the Shield plug-in:

cluster.name=Elasticsearch_cluster_name
shield.user=shield_username:shield_password

The Shield plug-in also supports additional capabilities like SSL and IP filtering. The properties can be set in the client.properties file, see https://www.elastic.co/guide/en/elasticsearch/client/java-api/2.4/transport-client.html and https://www.elastic.co/guide/en/shield/current/_using_elasticsearch_java_clients_with_shield.html.

The client.properties file for Elasticsearch 5.x with the X-Pack plug-in is:

cluster.name=Elasticsearch_cluster_name
xpack.security.user=x-pack_username:x-pack-password

The X-Pack plug-in also supports additional capabilities. The properties can be set in the client.properties file, see https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.1/transport-client.html and https://www.elastic.co/guide/en/x-pack/current/java-clients.html.

4.4 Performance Consideration

The Elasticsearch Handler gg.handler.name.bulkWrite property is used to determine whether the source trail records should be pushed to the Elasticsearch cluster one at a time or in bulk using the bulk write API. When this property is true, the source trail operations are pushed to the Elasticsearch cluster in batches whose size can be controlled by the MAXTRANSOPS parameter in the generic Replicat parameter file. Using the bulk write API provides better performance.

Elasticsearch uses different thread pools to improve how memory consumption of threads are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

For bulk operations, the default queue size is 50 (in version 5.2) and 200 (in version 5.3).

To avoid bulk API errors, you must set the Replicat MAXTRANSOPS size to match the bulk thread pool queue size at a minimum. The configuration thread_pool.bulk.queue_size property can be modified in the elasticsearch.yaml file.

4.5 About the Shield Plug-In Support

Elasticsearch versions 2.x supports a Shield plug-in which provides basic authentication, SSL and IP filtering. Similar capabilities exists in the X-Pack plug-in for Elasticsearch 5.x. The additional transport client settings can be configured in the Elasticsearch Handler using the gg.handler.name.clientSettingsFile property.

4.6 About DDL Handling

The Elasticsearch Handler does not react to any DDL records in the source trail. Any data manipulation records for a new source table results in auto-creation of index or type in the Elasticsearch cluster.

4.7 Troubleshooting

This section contains information to help you troubleshoot various issues.

Topics:

4.7.1 Incorrect Java Classpath

The most common initial error is an incorrect classpath to include all the required client libraries and creates a ClassNotFound exception in the log4j log file.

Also, it may be due to an error resolving the classpath if there is a typographic error in the gg.classpath variable.

The Elasticsearch transport client libraries do not ship with the Oracle GoldenGate for Big Data product. You should properly configure the gg.classpath property in the Java Adapter Properties file to correctly resolve the client libraries, see Setting Up and Running the Elasticsearch Handler.

4.7.2 Elasticsearch Version Mismatch

The Elasticsearch Handler gg.handler.name.version property must be set to 2.x or 5.x to match the major version number of the Elasticsearch cluster.

The following errors may occur when there is a wrong version configuration:

Error: NoNodeAvailableException[None of the configured nodes are available:]

ERROR 2017-01-30 22:35:07,240 [main] Unable to establish connection. Check handler properties and client settings configuration.

java.lang.IllegalArgumentException: unknown setting [shield.user] 

Ensure that all required plug-ins are installed and review documentation changes for any removed settings.

4.7.3 Transport Client Properties File Not Found

To resolve this exception:

ERROR 2017-01-30 22:33:10,058 [main] Unable to establish connection. Check handler properties and client settings configuration.

Verify that the gg.handler.name.clientSettingsFile configuration property is correctly setting the Elasticsearch transport client settings file name. Verify that the gg.classpath variable includes the path to the correct file name and that the path to the properties file does not contain an asterisk (*) wildcard at the end.

4.7.4 Cluster Connection Problem

This error occurs when the Elasticsearch Handler is unable to connect to the Elasticsearch cluster:

Error: NoNodeAvailableException[None of the configured nodes are available:]

Use the following steps to debug the issue:

  1. Ensure that the Elasticsearch server process is running.

  2. Validate the cluster.name property in the client properties configuration file.

  3. Validate the authentication credentials for the x-Pack or Shield plug-in in the client properties file.

  4. Validate the gg.handler.name.ServerAddressList handler property.

4.7.5 Unsupported Truncate Operation

The following error occurs when the Elasticsearch Handler finds a TRUNCATE operation in the source trail:

oracle.goldengate.util.GGException: Elasticsearch Handler does not support the operation: TRUNCATE

This exception error message is written to the handler log file before the RAeplicat process abends. Removing the GETTRUNCATES parameter from the Replicat parameter file resolves this error.

4.7.6 Bulk Execute Errors

""
DEBUG [main] (ElasticSearch5DOTX.java:130) - Bulk execute status: failures:[true] buildFailureMessage:[failure in bulk execution: [0]: index [cs2cat_s1sch_n1tab], type [N1TAB], id [83], message [RemoteTransportException[[UOvac8l][127.0.0.1:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$7@43eddfb2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5ef5f412[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 84]]];]

It may be due to the Elasticsearch running out of resources to process the operation. You can limit the Replicat batch size using MAXTRANSOPS to match the value of the thread_pool.bulk.queue_size Elasticsearch configuration parameter.

Note:

Changes to the Elasticsearch parameter, thread_pool.bulk.queue_size, are effective only after the Elasticsearch node is restarted.

4.8 Logging

The following log messages appear in the handler log file on successful connection:

Connection to 2.x Elasticsearch cluster:

INFO 2017-01-31 01:43:38,814 [main] **BEGIN Elasticsearch client settings**
INFO 2017-01-31 01:43:38,860 [main]     key[cluster.name] value[elasticsearch-user1-myhost]
INFO 2017-01-31 01:43:38,860 [main]      key[request.headers.X-Found-Cluster] value[elasticsearch-user1-myhost]
INFO 2017-01-31 01:43:38,860 [main]     key[shield.user] value[es_admin:user1]
INFO 2017-01-31 01:43:39,715 [main] Connecting to Server[myhost.us.example.com] Port[9300]
INFO 2017-01-31 01:43:39,715 [main] Client node name:  Smith
INFO 2017-01-31 01:43:39,715 [main] Connected nodes: [{node-myhost}{vqtHikpGQP-IXieHsgqXjw}{10.196.38.196}{198.51.100.1:9300}]
INFO 2017-01-31 01:43:39,715 [main] Filtered nodes: []
INFO 2017-01-31 01:43:39,715 [main] **END Elasticsearch client settings**

Connection to a 5.x Elasticsearch cluster:

INFO [main] (Elasticsearch5DOTX.java:38) - **BEGIN Elasticsearch client settings**
INFO [main] (Elasticsearch5DOTX.java:39) - {xpack.security.user=user1:user1_kibana, cluster.name=elasticsearch-user1-myhost, request.headers.X-Found-Cluster=elasticsearch-user1-myhost}
INFO [main] (Elasticsearch5DOTX.java:52) - Connecting to Server[myhost.us.example.com] Port[9300]
INFO [main] (Elasticsearch5DOTX.java:64) - Client node name:  _client_
INFO [main] (Elasticsearch5DOTX.java:65) - Connected nodes: [{node-myhost}{w9N25BrOSZeGsnUsogFn1A}{bIiIultVRjm0Ze57I3KChg}{myhost}{198.51.100.1:9300}]
INFO [main] (Elasticsearch5DOTX.java:66) - Filtered nodes: []
INFO [main] (Elasticsearch5DOTX.java:68) - **END Elasticsearch client settings**

4.9 Known Issues in the Elasticsearch Handler

Elasticsearch: Trying to input very large number

Very large numbers result in inaccurate values with Elasticsearch document. For example, 9223372036854775807, -9223372036854775808. This is an issue with the Elasticsearch server and not a limitation of the Elasticsearch Handler.

The workaround for this issue is to ingest all the number values as strings using the gg.handler.name.numberAsString=true property.

Elasticsearch: Issue with index

The Elasticsearch Handler is not able to input data into the same index if there are more than one table with similar column names and different column data types.

Index names are always lowercase though the catalog/schema/tablename in the trail may be case-sensitive.