2 Using the BigQuery Handler

Learn how to use the Google BigQuery Handler, which streams change data capture data from source trail files into Google BigQuery.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage, see https://cloud.google.com/bigquery/.

Topics:

2.1 Detailing the Functionality

Topics:

2.1.1 Data Types

The BigQuery Handler supports the standard SQL data types and most of these data types are supported by the BigQuery Handler. A data type conversion from the column value in the trail file to the corresponding Java type representing the BigQuery column type in the BigQuery Handler is required.

The following data types are supported:

STRING
BYTES
INTEGER
FLOAT
NUMERIC
BOOLEAN
TIMESTAMP
DATE
TIME
DATETIME

The BigQuery Handler does not support complex data types, such as ARRAY and STRUCT.

2.1.2 Operation Modes

You can configure the BigQuery Handler in one of these two modes:

auditLogMode = true

When the handler is configured to run with Audit log mode, the data is pushed into Google BigQuery without a unique id and primary key. As a result, Google BigQuery is not able to merge different operations on the same row.

Also, the order in which the audit log is displayed in the BigQuery data set is not deterministic.

To overcome these limitations, you need to specify optype and position in the meta columns template for the handler. This adds two columns of the same names in the schema for the table in Google BigQuery. For example:

gg.handler.bigquery.metaColumnsTemplate = ${optype}, ${position}

The optype is important to determine the operation type for the row in the audit log.

To view the audit log in order of the operations processed in the trail file, specify position which can be used in the ORDER BY clause while querying the table in Google BigQuery. For example:

SELECT * FROM [projectId:datasetId.tableId] ORDER BY position
auditLogMode = false

This causes the handler to write data into Google BigQuery specifying a unique id and primary key for each row. As a result, Google BigQuery is able to merge different operations on the same row.

The trail source needs to have a full image of the records in order to merge correctly.

Google BigQuery processes every operation as an insert for each row. As a result, there is a deleted column added to the schema for the table in this mode of operation. When the handler encounters a delete operation on a row, it inserts the row into Google BigQuery and sets the deleted column to true.

To view data in the BigQuery table like it would ideally be seen in a RDBMS, specify a WHERE deleted = false clause while querying the table in Google BigQuery.

2.1.3 Operation Processing Support

The BigQuery Handler pushes operations to Google BigQuery using synchronous API. Insert, update, and delete operations are processed differently in BigQuery than in a traditional RDBMS.

The following explains how insert, update, and delete operations are interpreted by the handler depending on the mode of operation:

auditLogMode = true
  • insert – Inserts the record with optype as an insert operation in the BigQuery table.

  • update –Inserts the record with optype as an update operation in the BigQuery table.

  • delete – Inserts the record with optype as a delete operation in the BigQuery table.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sends out a delete operation followed by an insert operation. Both these rows have the same position in the BigQuery table, which helps to identify it as a primary key operation and not a separate delete and insert operation.

auditLogMode = false
  • insert – If the row does not already exist in Google BigQuery, then an insert operation is processed as an insert. If the row already exists in Google BigQuery, then an insert operation is processed as an update. The handler sets the deleted column to false.

  • update –If a row does not exist in Google BigQuery, then an update operation is processed as an insert. If the row already exists in Google BigQuery, then an update operation is processed as update. The handler sets the deleted column to false.

  • delete – If the row does not exist in Google BigQuery, then a delete operation has no effect. If the row exists in Google BigQuery, then a delete operation is processed as a delete. The handler sets the deleted column to true.

  • pkUpdate—When pkUpdateHandling property is configured as delete-insert, the handler sets the deleted column to true for the row whose primary key is updated. It is followed by a separate insert operation with the new primary key and the deleted column set to false for this row.

Do not toggle the audit log mode because it forces the BigQuery handler to abend as Google BigQuery cannot alter schema of an existing table. The existing table needs to be deleted before switching audit log modes.

Note:

The BigQuery Handler does not support the truncate operation. It abends when it encounters a truncate operation.

2.2 Setting Up and Running the BigQuery Handler

The BigQuery Client library does not ship with Oracle GoldenGate for Big Data. You must download the latest version of the Java Client library for BigQuery at:

https://developers.google.com/api-client-library/java/apis/bigquery/v2

You must configure the gg.classpath configuration property in the Java Adapter properties file to specify the JARs for the Java Client Library for BigQuery. The path to the dependency JARs must include the asterisk (*) wildcard character to include all of the JAR files in that directory in the associated classpath. Donot use *.jar. This is an example of the correctly configured classpath:

gg.classpath= /path_to_repository/bigquery/libs/*:/path_to_repository/bigquery/*

Next, download the following JARs from Maven Central, and then include them in the classpath for the BigQuery Handler:

•       api-common-1.6.0.jar
•  gax-1.28.0.jar
•  gax-httpjson-0.45.0.jar
•  google-auth-library-credentials-0.9.1.jar
•  google-auth-library-oauth2-http-0.9.1.jar
•  google-cloud-bigquery-1.31.0.jar
•  google-cloud-core-1.35.0.jar
•  google-cloud-core-http-1.35.0.jar
•  google-http-client-jackson-1.23.0.jar
•  guava-25.1-jre.jar
•  threetenbp-1.3.6.jar

2.2.1 Understanding the BigQuery Handler Configuration

The following are the configurable values for the BigQuery Handler. These properties are located in the Java Adapter properties file (not in the Replicat properties file).

To enable the selection of the BigQuery Handler, you must first configure the handler type by specifying gg.handler.jdbc.type=bigquery and the other BigQuery properties as follows:

Table 2-1 BigQuery Handler Configuration Properties

Properties Required/ Optional Legal Values Default Explanation

gg.handlerlist

Required

Any string

None

Provides a name for the BigQuery Handler. The BigQuery Handler name then becomes part of the property names listed in this table.

gg.handler.name.type=bigquery

Required

bigquery

None

Selects the BigQuery Handler for streaming change data capture into Google BigQuery.

gg.handler.name.credentialsFile

Optional

Relative or absolute path to the credentials file

None

The credentials file downloaded from Google BigQuery for authentication. If you do not specify the path to the credentials file, you need to set it as an environment variable, see Configuring Handler Authentication.

gg.handler.name.projectId

Required

Any string

None

The name of the project in Google BigQuery. The handler needs project ID to connect to Google BigQuery store.

gg.handler.name.datasetId

Optional

Any string

default_dataset

The name of the data set the tables are stored in. If not specified, the handler creates a new data set named default_dataset and inserts the table into it.

gg.handler.name.batchSize

Optional

Any number

500

The maximum number of operations to be batched together. This is applicable for all target table batches.

gg.handler.name.batchFlushFrequency

Optional

Any number

1000

The maximum amount of time in milliseconds to wait before executing the next batch of operations. This is applicable for all target table batches.

gg.handler.name.skipInvalidRows

Optional

true | false

false

Sets whether to insert all valid rows of a request, even if invalid rows exist. If not set, the entire insert request fails if it contains an invalid row.

gg.handler.name.ignoreUnknownValues

Optional

true | false

false

Sets whether to accept rows that contain values that do not match the schema. If not set, rows with unknown values are considered to be invalid.

gg.handler.name.connectionTimeout

Optional

Positive integer

20000

The maximum amount of time, in milliseconds, to wait for the handler to establish a connection with Google BigQuery.

gg.handler.name.readTimeout

Optional

Positive integer

30000

The maximum amount of time in milliseconds to wait for the handler to read data from an established connection.

gg.handler.name.metaColumnsTemplate

Optional

A legal string

None

A legal string specifying the metaColumns to be included. If you set auditLogMode to true, it is important that you set the metaColumnsTemplate property to view the operation type for the row inserted in the audit log.

gg.handler.name.auditLogMode

Optional

true | false

false

Set to true, the handler writes each record to the target without any primary key. Everything is processed as insert.

Set to false, the handler tries to merge incoming records into the target table if they have the same primary key. Primary keys are needed for this property. The trail source records need to have a full image updates to merge correctly.

gg.handler.name.pkUpdateHandling

Optional

abend | delete-insert

abend

Sets how the handler handles update operations that change a primary key. Primary key operations can be problematic for the BigQuery Handler and require special consideration:

  • abend- indicates the process abends.

  • delete-insert- indicates the process treats the operation as a delete and an insert. The full before image is required for this property to work correctly. Without full before and after row images the insert data are incomplete. Oracle recommends this option.

2.2.2 Review a Sample Configuration

The following is a sample configuration for the BigQuery Handler from the Java Adapter properties file:

gg.handlerlist = bigquery

#The handler properties
gg.handler.bigquery.type = bigquery
gg.handler.bigquery.projectId = festive-athlete-201315
gg.handler.bigquery.datasetId = oggbd
gg.handler.bigquery.credentialsFile = credentials.json
gg.handler.bigquery.auditLogMode = true
gg.handler.bigquery.pkUpdateHandling = delete-insert

gg.handler.bigquery.metaColumnsTemplate =${optype}, ${position}

2.2.3 Proxy Settings

To connect to BigQuery using a proxy server, you must configure the proxy host and the proxy port in the properties file as follows:

javawriter.bootoptions= -Dhttps.proxyHost=proxy_host_name
                                                -Dhttps.proxyPort=proxy_port_number

2.2.4 Configuring Handler Authentication

You have to configure the BigQuery Handler authentication using the credentials in the JSON file downloaded from Google BigQuery.

Download the credentials file:

  1. Login into your Google account at cloud.google.com.

  2. Click Console, and then to go to the Dashboard where you can select your project.

  3. From the navigation menu, click APIs & Services then select Credentials.

  4. From the Create Credentials menu, choose Service account key.

  5. Choose the JSON key type to download the JSON credentials file for your system.

Once you have the credentials file, you can authenticate the handler in one of these two ways:

  • Specify the path to the credentials file in the properties file with the gg.handler.name.credentialsFile configuration property.

    The path of the credentials file must contain the path with no wildcard appended. If you include the * wildcard in the path to the credentials file, the file is not recognized.

    Or

  • Set the GOOGLE_APPLICATION_CREDENTIALS environment variable on your system. For example:

    export GOOGLE_APPLICATION_CREDENTIALS = credentials.json
    

    Then restart the Oracle GoldenGate manager process.