Learn how to use the Cassandra Handler, which provides the interface to Apache Cassandra databases.
Topics:
Apache Cassandra is a NoSQL Database Management System designed to store large amounts of data. A Cassandra cluster configuration provides horizontal scaling and replication of data across multiple machines. It can provide high availability and eliminate a single point of failure by replicating data to multiple nodes within a Cassandra cluster. Apache Cassandra is open source and designed to run on low-cost commodity hardware.
Cassandra relaxes the axioms of a traditional relational database management systems (RDBMS) regarding atomicity, consistency, isolation, and durability. When considering implementing Cassandra, it is important to understand its differences from a traditional RDBMS and how those differences affect your specific use case.
Cassandra provides eventual consistency. Under the eventual consistency model, accessing the state of data for a specific row eventually returns the latest state of the data for that row as defined by the most recent change. However, there may be a latency period between the creation and modification of the state of a row and what is returned when the state of that row is queried. The benefit of eventual consistency is that the latency period is predicted based on your Cassandra configuration and the level of work load that your Cassandra cluster is currently under, see http://cassandra.apache.org/.
The Cassandra Handler provides some control over consistency with the configuration of the gg.handler.name.consistencyLevel
property in the Java Adapter properties file.
Parent topic: Using the Cassandra Handler
Topics:
Parent topic: Using the Cassandra Handler
Cassandra provides a number of column data types and most of these data types are supported by the Cassandra Handler.
ASCII BIGINT BLOB BOOLEAN DATE DECIMAL DOUBLE DURATION FLOAT INET INT SMALLINT TEXT TIME TIMESTAMP TIMEUUID TINYINT UUID VARCHAR VARINT
COUNTER MAP SET LIST UDT (user defined type) TUPLE CUSTOM_TYPE
INSERT UPDATE (captured as INSERT) DELETE
The Cassandra commit log files do not record any before images for the UPDATE
or DELETE
operations. So the captured operations never have a before image section. Oracle GoldenGate features that rely on before image records, such as Conflict Detection and Resolution, are not available.
TRUNCATE DDL (CREATE, ALTER, DROP)
The data type of the column value in the source trail file must be converted to the corresponding Java type representing the Cassandra column type in the Cassandra Handler. This data conversion introduces the risk of a runtime conversion error. A poorly mapped field (such as varchar
as the source containing alpha numeric data to a Cassandra int
) may cause a runtime error and cause the Cassandra Handler to abend. You can view the Cassandra Java type mappings at:
https://github.com/datastax/java-driver/tree/3.x/manual#cql-to-java-type-mapping
It is possible that the data may require specialized processing to get converted to the corresponding Java type for intake into Cassandra. If this is the case, you have two options:
Try to use the general regular expression search and replace functionality to format the source column value data in a way that can be converted into the Java data type for use in Cassandra.
Or
Implement or extend the default data type conversion logic to override it with custom logic for your use case. Contact Oracle Support for guidance.
Parent topic: Detailing the Functionality
Traditional RDBMSs separate structured data into tables. Related tables are included in higher-level collections called databases. Cassandra contains both of these concepts. Tables in an RDBMS are also tables in Cassandra, while database schemas in an RDBMS are keyspaces in Cassandra.
It is important to understand how data maps from the metadata definition in the source trail file are mapped to the corresponding keyspace and table in Cassandra. Source tables are generally either two-part names defined as schema.table
,or three-part names defined as catalog.schema.table
.
The following table explains how catalog, schema, and table names map into Cassandra. Unless you use special syntax, Cassandra converts all keyspace, table names, and column names to lower case.
Table Name in Source Trail File | Cassandra Keyspace Name | Cassandra Table Name |
---|---|---|
|
|
|
|
|
|
|
|
|
Parent topic: Detailing the Functionality
Topics:
Parent topic: Detailing the Functionality
The Cassandra Handler does not automatically create keyspaces in Cassandra. Keyspaces in Cassandra define a replication factor, the replication strategy, and topology. The Cassandra Handler does not have enough information to create the keyspaces, so you must manually create them.
You can create keyspaces in Cassandra by using the CREATE KEYSPACE
command from the Cassandra shell.
Parent topic: About DDL Functionality
The Cassandra Handler can automatically create tables in Cassandra if you configure it to do so. The source table definition may be a poor source of information to create tables in Cassandra. Primary keys in Cassandra are divided into:
Partitioning keys that define how data for a table is separated into partitions in Cassandra.
Clustering keys that define the order of items within a partition.
In the default mapping for automated table creation, the first primary key is the partition key, and any additional primary keys are mapped as clustering keys.
Automated table creation by the Cassandra Handler may be fine for proof of concept, but it may result in data definitions that do not scale well. When the Cassandra Handler creates tables with poorly constructed primary keys, the performance of ingest and retrieval may decrease as the volume of data stored in Cassandra increases. Oracle recommends that you analyze the metadata of your replicated tables, then manually create corresponding tables in Cassandra that are properly partitioned and clustered for higher scalability.
Primary key definitions for tables in Cassandra are immutable after they are created. Changing a Cassandra table primary key definition requires the following manual steps:
Create a staging table.
Populate the data in the staging table from original table.
Drop the original table.
Re-create the original table with the modified primary key definitions.
Populate the data in the original table from the staging table.
Drop the staging table.
Parent topic: About DDL Functionality
You can configure the Cassandra Handler to add columns that exist in the source trail file table definition but are missing in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of this kind. A reconciliation process reconciles the source table definition to the Cassandra table definition. When the Cassandra Handler is configured to add columns, any columns found in the source table definition that do not exist in the Cassandra table definition are added. The reconciliation process for a table occurs after application startup the first time an operation for the table is encountered. The reconciliation process reoccurs after a metadata change event on a source table, when the first operation for the source table is encountered after the change event.
Parent topic: About DDL Functionality
You can configure the Cassandra Handler to drop columns that do not exist in the source trail file definition but exist in the Cassandra table definition. The Cassandra Handler can accommodate metadata change events of this kind. A reconciliation process reconciles the source table definition to the Cassandra table definition. When the Cassandra Handler is configured to drop, columns any columns found in the Cassandra table definition that are not in the source table definition are dropped.
Caution:
Dropping a column permanently removes data from a Cassandra table. Carefully consider your use case before you configure this mode.
Note:
Primary key columns cannot be dropped. Attempting to do so results in an abend.
Note:
Column name changes are not well-handled because there is no DDL is processed. When a column name changes in the source database, the Cassandra Handler interprets it as dropping an existing column and adding a new column.
Parent topic: About DDL Functionality
The Cassandra Handler pushes operations to Cassandra using either the asynchronous or synchronous API. In asynchronous mode, operations are flushed at transaction commit (grouped transaction commit using GROUPTRANSOPS
) to ensure write durability. The Cassandra Handler does not interface with Cassandra in a transactional way.
INSERT UPDATE (captured as INSERT) DELETE
The Cassandra commit log files do not record any before images for the UPDATE
or DELETE
operations. So the captured operations never have a before image section. Oracle GoldenGate features that rely on before image records, such as Conflict Detection and Resolution, are not available.
TRUNCATE DDL (CREATE, ALTER, DROP)
Insert, update, and delete operations are processed differently in Cassandra than a traditional RDBMS. The following explains how insert, update, and delete operations are interpreted by Cassandra:
Inserts: If the row does not exist in Cassandra, then an insert operation is processed as an insert. If the row already exists in Cassandra, then an insert operation is processed as an update.
Updates: If a row does not exist in Cassandra, then an update operation is processed as an insert. If the row already exists in Cassandra, then an update operation is processed as insert.
Delete:If the row does not exist in Cassandra, then a delete operation has no effect. If the row exists in Cassandra, then a delete operation is processed as a delete.
The state of the data in Cassandra is idempotent. You can replay the source trail files or replay sections of the trail files. The state of the Cassandra database must be the same regardless of the number of times that the trail data is written into Cassandra.
Parent topic: Detailing the Functionality
Oracle GoldenGate allows you to control the data that is propagated to the source trail file in the event of an update. The data for an update in the source trail file is either a compressed or a full image of the update, and the column information is provided as follows:
For the primary keys and the columns for which the value changed. Data for columns that have not changed is not provided in the trail file.
For all columns, including primary keys, columns for which the value has changed, and columns for which the value has not changed.
The amount of information about an update is important to the Cassandra Handler. If the source trail file contains full images of the change data, then the Cassandra Handler can use prepared statements to perform row updates in Cassandra. Full images also allow the Cassandra Handler to perform primary key updates for a row in Cassandra. In Cassandra, primary keys are immutable, so an update that changes a primary key must be treated as a delete and an insert. Conversely, when compressed updates are used, prepared statements cannot be used for Cassandra row updates. Simple statements identifying the changing values and primary keys must be dynamically created and then executed. With compressed updates, primary key updates are not possible and as a result, the Cassandra Handler will abend.
You must set the control properties gg.handler.name.compressedUpdates
and gg.handler.name.compressedUpdatesfor
so that the handler expects either compressed or full image updates.
The default value, true
, sets the Cassandra Handler to expect compressed updates. Prepared statements are not be used for updates, and primary key updates cause the handler to abend.
When the value is false
, prepared statements are used for updates and primary key updates can be processed. A source trail file that does not contain full image data can lead to corrupted data columns, which are considered null. As a result, the null value is pushed to Cassandra. If you are not sure about whether the source trail files contains compressed or full image data, set gg.handler.name.compressedUpdates
to true
.
CLOB and BLOB data types do not propagate LOB data in updates unless the LOB column value changed. Therefore, if the source tables contain LOB data, set gg.handler.name.compressedUpdates
to true
.
Parent topic: Detailing the Functionality
Primary key values for a row in Cassandra are immutable. An update operation that changes any primary key value for a Cassandra row must be treated as a delete and insert. The Cassandra Handler can process update operations that result in the change of a primary key in Cassandra only as a delete and insert. To successfully process this operation, the source trail file must contain the complete before and after change data images for all columns. The gg.handler.name.compressed
configuration property of the Cassandra Handler must be set to false
for primary key updates to be successfully processed.
Parent topic: Detailing the Functionality
Instructions for configuring the Cassandra Handler components and running the handler are described in the following sections.
Before you run the Cassandra Handler, you must install the Datastax Driver for Cassandra and set the gg.classpath
configuration property.
Get the Driver Libraries
The Datastax Java Driver for Cassandra does not ship with Oracle GoldenGate for Big Data. You can download the recommended version of the Datastax Java Driver for Cassandra 3.1 at:
https://github.com/datastax/java-driver
Set the Classpath
You must configure the gg.classpath
configuration property in the Java Adapter properties file to specify the JARs for the Datastax Java Driver for Cassandra. Ensure that this JAR is first in the list.
gg.classpath=/path_to_repository/com/datastax/cassandra/cassandra-drive r-core/3.3.1/cassandra-driver-core-3.3.1.jar:/path_to_apache_cassandra/cassandra-3.11.0/ lib/*
Topics:
Parent topic: Using the Cassandra Handler
The following are the configurable values for the Cassandra Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).
To enable the selection of the Cassandra Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=cassandra
and the other Cassandra properties as follows:
Table 3-1 Cassandra Handler Configuration Properties
Properties | Required/ Optional | Legal Values | Default | Explanation |
---|---|---|---|---|
|
Required |
Any string |
None |
Provides a name for the Cassandra Handler. The Cassandra Handler name then becomes part of the property names listed in this table. |
|
Required |
|
None |
Selects the Cassandra Handler for streaming change data capture into name. |
|
Optional |
|
|
The default is recommended. In |
|
Optional |
A comma separated list of host names that the Cassandra Handler will connect to. |
|
A comma-separated list of the Cassandra host machines for the driver to establish an initial connection to the Cassandra cluster. This configuration property does not need to include all the machines enlisted in the Cassandra cluster. By connecting to a single machine, the driver can learn about other machines in the Cassandra cluster and establish connections to those machines as required. |
|
Optional |
A legal username string. |
None |
A user name for the connection to name. Required if Cassandra is configured to require credentials. |
|
Optional |
A legal password string. |
None |
A password for the connection to name. Required if Cassandra is configured to require credentials. |
|
Optional |
|
|
Sets the Cassandra Handler whether to expect full image updates from the source trail file. A value of A value of |
|
Optional |
|
None |
Configures the Cassandra Handler for the DDL functionality to provide. Options include When When When |
|
Optional |
|
|
Sets the interaction between the Cassandra Handler and name. Set to Set to |
|
Optional |
|
The Cassandra default. |
Sets the consistency level for operations with name. It configures the criteria that must be met for storage on the Cassandra cluster when an operation is executed. Lower levels of consistency may provide better performance, while higher levels of consistency are safer. An advanced configuration property so that you can override the SSL |
gg.handler. name. withSSL |
Optional |
|
|
Set to true to enable secured connections to the Cassandra cluster using SSL. This requires additional Java boot options configuration, see http://docs.datastax.com/en/developer/java-driver/3.3/manual/ssl/. |
gg.handler. name. port |
Optional |
Integer |
|
Set to configure the port number that the Cassandra Handler attempts to connect to Cassandra server instances. You can override the default in the Cassandra YAML files. |
Parent topic: Setting Up and Running the Cassandra Handler
The following is a sample configuration for the Cassandra Handler from the Java Adapter properties file:
gg.handlerlist=cassandra #The handler properties gg.handler.cassandra.type=cassandra gg.handler.cassandra.mode=op gg.handler.cassandra.contactPoints=localhost gg.handler.cassandra.ddlHandling=CREATE,ADD,DROP gg.handler.cassandra.compressedUpdates=true gg.handler.cassandra.cassandraMode=async gg.handler.cassandra.consistencyLevel=ONE
Parent topic: Setting Up and Running the Cassandra Handler
The Cassandra Handler connection to the Cassandra Cluster can be secured using user name and password credentials. These are set using the following configuration properties:
gg.handler.name.username gg.handler.name.password
Optionally, the connection to the Cassandra cluster can be secured using SSL. To enable SSL security set the following parameter:
gg.handler.name.withSSL=true
Additionally, the Java bootoptions
must be configured to include the location and password of the keystore
and the location and password of the truststore
. For example:
javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=.:ggjava/ggjava.jar:./dirprm -Djavax.net.ssl.trustStore=/path/to/client.truststore -Djavax.net.ssl.trustStorePassword=password123 -Djavax.net.ssl.keyStore=/path/to/client.keystore -Djavax.net.ssl.keyStorePassword=password123
Parent topic: Setting Up and Running the Cassandra Handler
The Cassandra Handler performs the table check and reconciliation process the first time an operation for a source table is encountered. Additionally, a DDL event or a metadata change event causes the table definition in the Cassandra Handler to be marked as dirty. Therefore, the next time an operation for the table is encountered, the handler repeats the table check, and reconciliation process as described in the following section.
Topics:
Parent topic: Using the Cassandra Handler
The Cassandra Handler first interrogates the target Cassandra database to determine whether the target Cassandra keyspace exists. If the target Cassandra keyspace does not exist, then the Cassandra Handler abends. Keyspaces must be created by the user. The log file must contain the error of the exact keyspace name that the Cassandra Handler is expecting.
Next, the Cassandra Handler interrogates the target Cassandra database for the table definition. If the table does not exist, the Cassandra Handler either creates a table if gg.handler.name.ddlHandling
includes the CREATE
option or abends the process. A message is logged that shows you the table that does not exist in Cassandra.
If the table exists in Cassandra, then the Cassandra Handler reconciles the table definition from the source trail file and the table definition in Cassandra. This reconciliation process searches for columns that exist in the source table definition and not in the corresponding Cassandra table definition. If it locates columns fitting this criteria and the gg.handler.name.ddlHandling
property includes ADD
, then the Cassandra Handler adds the columns to the target table in Cassandra. Otherwise, it ignores these columns.
Next, the Cassandra Handler searches for columns that exist in the target Cassandra table but do not exist in the source table definition. If it locates columns that fit this criteria and the gg.handler.name.ddlHandling
property includes DROP
, then the Cassandra Handler removes these columns from the target table in Cassandra. Otherwise those columns are ignored.
Finally, the prepared statements are built.
Parent topic: About Automated DDL Handling
You can capture all of the new change data into your Cassandra database, including the DDL changes in the trail, for the target apply. Following is the acceptance criteria:
AC1: Support Cassandra as a bulk extract AC2: Support Cassandra as a CDC source AC4: All Cassandra supported data types are supported AC5: Should be able to write into different tables based on any filter conditions, like Updates to Update tables or based on primary keys AC7: Support Parallel processing with multiple threads AC8: Support Filtering based on keywords AC9: Support for Metadata provider AC10: Support for DDL handling on sources and target AC11: Support for target creation and updating of metadata. AC12: Support for error handling and extensive logging AC13: Support for Conflict Detection and Resolution AC14: Performance should be on par or better than HBase
Parent topic: About Automated DDL Handling
Configuring the Cassandra Handler for async
mode provides better performance than sync
mode. Set Replicat property GROUPTRANSOPS
must be set to the default value of 1000.
Setting the consistency level directly affects performance. The higher the consistency level, the more work must occur on the Cassandra cluster before the transmission of a given operation can be considered complete. Select the minimum consistency level that still satisfies the requirements of your use case.
The Cassandra Handler can work in either operation (op
) or transaction (tx
) mode. For the best performance operation mode is recommended:
gg.handler.name.mode=op
Parent topic: Using the Cassandra Handler
Cassandra database requires at least one primary key. The value of any primary key cannot be null. Automated table creation fails for source tables that do not have a primary key.
gg.handler.name.compressedUpdates=false
is set, the Cassandra Handler expects to update full before and after images of the data.
Note:
Using this property setting with a source trail file with partial image updates results in null values being updated for columns for which the data is missing. This configuration is incorrect and update operations pollute the target data with null values in columns that did not change.The Cassandra Handler does not process DDL from the source database, even if the source database provides DDL Instead, it reconciles between the source table definition and the target Cassandra table definition. A DDL statement executed at the source database that changes a column name appears to the Cassandra Handler as if a column is dropped from the source table and a new column is added. This behavior depends on how the gg.handler.name.ddlHandling
property is configured.
gg.handler.name.ddlHandling Configuration | Behavior |
---|---|
Not configured for |
Old column name and data maintained in Cassandra. New column is not created in Cassandra, so no data is replicated for the new column name from the DDL change forward. |
Configured for |
Old column name and data maintained in Cassandra. New column iscreated in Cassandra and data replicated for the new column name from the DDL change forward. Column mismatch between the data is located before and after the DDL change. |
Configured for |
Old column name and data dropped in Cassandra. New column is not created in Cassandra, so no data replicated for the new column name. |
Configured for |
Old column name and data dropped in Cassandra. New column is created in Cassandra, and data is replicated for the new column name from the DDL change forward. |
Parent topic: Using the Cassandra Handler
This section contains information to help you troubleshoot various issues. Review the following topics for additional help:
Parent topic: Using the Cassandra Handler
When the classpath that is intended to include the required client libraries, a ClassNotFound
exception appears in the log file. To troubleshoot, set the Java Adapter logging to DEBUG
, and then run the process again. At the debug level, the log contains data about the JARs that were added to the classpath from the gg.classpath
configuration variable. The gg.classpath
variable selects the asterisk (*
) wildcard character to select all JARs in a configured directory. For example, /usr/cassandra/cassandra-java-driver-3.3.1/*:/usr/cassandra/cassandra-java-driver-3.3.1/lib/*
.
For more information about setting the classpath, see Setting Up and Running the Cassandra Handlerand Cassandra Handler Client Dependencies.
Parent topic: Troubleshooting
**** Begin Cassandra Handler - Configuration Summary **** Mode of operation is set to op. The Cassandra cluster contact point(s) is [localhost]. The handler has been configured for GoldenGate compressed updates (partial image updates). Sending data to Cassandra in [ASYNC] mode. The Cassandra consistency level has been set to [ONE]. Cassandra Handler DDL handling: The handler will create tables in Cassandra if they do not exist. The handler will add columns to Cassandra tables for columns in the source metadata that do not exist in Cassandra. The handler will drop columns in Cassandra tables for columns that do not exist in the source metadata. **** End Cassandra Handler - Configuration Summary ****
Parent topic: Troubleshooting
When running the Cassandra handler, you may experience a com.datastax.driver.core.exceptions.WriteTimeoutException
exception that causes the Replicat process to abend. It is likely to occur under some or all of the following conditions:
The Cassandra Handler processes large numbers of operations, putting the Cassandra cluster under a significant processing load.
GROUPTRANSOPS
is configured higher than the value of 1000 default.
The Cassandra Handler is configured in asynchronous mode.
The Cassandra Handler is configured with a consistency level higher than ONE
.
When this problem occurs, the Cassandra Handler is streaming data faster than the Cassandra cluster can process it. The write latency in the Cassandra cluster finally exceeds the write request timeout period, which in turn results in the exception.
The following are potential solutions:
Increase the write request timeout period. This is controlled with the write_request_timeout_in_ms
property in Cassandra and is located in the cassandra.yaml
file in the cassandra_install/conf
directory. The default is 2000 (2 seconds). You can increase this value to move past the error, and then restart the Cassandra node or nodes for the change to take effect.
Decrease the GROUPTRANSOPS
configuration value of the Replicat process. Typically, decreasing the GROUPTRANSOPS
configuration decreases the size of transactions processed and reduces the likelihood that the Cassandra Handler can overtax the Cassandra cluster.
Reduce the consistency level of the Cassandra Handler. This in turn reduces the amount of work the Cassandra cluster has to complete for an operation to be considered as written.
Parent topic: Troubleshooting
The java.lang.NoClassDefFoundError: io/netty/util/Timer
error can occur in both the 3.3 and 3.2 versions of downloaded Datastax Java Driver. This is because the netty-common
JAR file is inadvertently missing from the Datastax driver tar file. You must manually obtain thenetty-common
JAR file of the same netty version, and then add it to the classpath.
Parent topic: Troubleshooting
If you didn’t add the cassandra-driver-core-3.3.1.jar
file in the gg.classpath
property, then this exception can occur:
com.datastax.driver.core.exceptions.UnresolvedUserTypeException: Cannot
resolve user type keyspace.duration
If there are tables with a duration
data type column, this exception occurs. Using the Cassandra driver, cassandra-driver-core-3.3.1.jar
in the gg.classpath
property resolves the error. See Setting Up and Running the Cassandra Handler.
Parent topic: Troubleshooting