ORACLE_HDFS and ORACLE_HIVE Access Drivers

17 ORACLE_HDFS and ORACLE_HIVE Access Drivers

The properties used to create an external table that uses the ORACLE_HDFS or ORACLE_HIVE access drivers are specified in a SQL CREATE TABLE ORGANIZATION EXTERNAL statement, in the opaque_format_spec clause of ACCESS PARAMETERS.

The following topics are included:

Syntax Rules for Specifying Properties
The properties are set using keyword-value pairs in the SQL CREATE TABLE ACCESS PARAMETERS clause and in the configuration files.
ORACLE_HDFS Access Parameters
The access parameters for the ORACLE_HDFS access driver provide the metadata needed to locate the data in HDFS and generate a Hive table over it.
ORACLE_HIVE Access Parameters
ORACLE_HIVE retrieves metadata about external data sources from the Hive catalog.
Descriptions of com.oracle.bigdata Parameters
This topic provides descriptions of the com.oracle.bigdata parameters used by the ORACLE_HIVE and ORACLE_HDFS access drivers.

17.1 Syntax Rules for Specifying Properties

The properties are set using keyword-value pairs in the SQL CREATE TABLE ACCESS PARAMETERS clause and in the configuration files.

The syntax must obey these rules:

The format of each keyword-value pair is a keyword, a colon or equal sign, and a value. The following are valid keyword-value pairs:
```
keyword=value
keyword:value
```
The value is everything from the first non-whitespace character after the separator to the end of the line. Whitespace between the separator and the value is ignored. Trailing whitespace for the value is retained.
A property definition can be on one line or multiple lines.
A line terminator is a line feed, a carriage return, or a carriage return followed by line feeds.
When a property definition spans multiple lines, then precede the line terminators with a backslash (escape character), except on the last line. In this example, the value of the Keyword1 property is Value part 1 Value part 2 Value part 3.
```
Keyword1=  Value part 1 \
           Value part 2 \
           Value part 3
```
You can create a logical line by stripping each physical line of leading whitespace and concatenating the lines. The parser extracts the property names and values from the logical line.
You can embed special characters in a property name or property value by preceding a character with a backslash (escape character), indicating the substitution. Table 17-1 describes the special characters.

Table 17-1 Special Characters in Properties

Escape Sequence	Character
`\b`	Backspace (`\u0008`)
`\t`	Horizontal tab (`\u0009`)
`\n`	Line feed (`\u000a)`
`\f`	Form feed (`\u000c`)
`\r`	Carriage return (`\u000d`)
`\"`	Double quote (`\u0022`)
`\'`	Single quote (`\u0027`)
`\\`	Backslash (`\u005c`) When multiple backslashes are at the end of the line, the parser continues the value to the next line only for an odd number of backslashes.
`\uxxxx`	2-byte, big-endian, Unicode code point. When a character requires two code points (4 bytes), the parser expects `\u` for the second code point.

Parent topic: ORACLE_HDFS and ORACLE_HIVE Access Drivers

17.2 ORACLE_HDFS Access Parameters

The access parameters for the ORACLE_HDFS access driver provide the metadata needed to locate the data in HDFS and generate a Hive table over it.

Default Parameter Settings for ORACLE_HDFS
Describes default parameter settings for ORACLE_HDFS.
Optional Parameter Settings for ORACLE_HDFS
ORACLE_HDFS supports optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause.

Parent topic: ORACLE_HDFS and ORACLE_HIVE Access Drivers

17.2.1 Default Parameter Settings for ORACLE_HDFS

Describes default parameter settings for ORACLE_HDFS.

If you omit all access parameters from the CREATE TABLE statement, then ORACLE_HDFS uses the following default values:

com.oracle.bigdata.rowformat=DELIMITED
com.oracle.bigdata.fileformat=TEXTFILE
com.oracle.bigdata.overflow={"action":"error"}
com.oracle.bigdata.erroropt={"action":"setnull"}

Parent topic: ORACLE_HDFS Access Parameters

17.2.2 Optional Parameter Settings for ORACLE_HDFS

ORACLE_HDFS supports optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause.

Specifically:

The following example shows a CREATE TABLE statement in which multiple access parameters are set.

Example 17-1 Setting Multiple Access Parameters for ORACLE_HDFS

CREATE TABLE ORDER (CUST_NUM VARCHAR2(10), 
                    ORDER_NUM VARCHAR2(20), 
                    ORDER_DATE DATE,
                    ITEM_CNT NUMBER,
                    DESCRIPTION VARCHAR2(100),
                    ORDER_TOTAL (NUMBER8,2)) ORGANIZATION EXTERNAL 
             (TYPE ORACLE_HDFS
 ACCESS PARAMETERS (
  com.oracle.bigdata.fields: (CUST_NUM,              \
                              ORDER_NUM,             \
                              ORDER_DATE,            \
                              ORDER_LINE_ITEM_COUNT, \
                              DESCRIPTION,           \
                              ORDER_TOTAL)
  com.oracle.bigdata.colMap:     {"col":"item_cnt", \
                                  "field":"order_line_item_count"}
  com.oracle.bigdata.overflow:   {"action":"TRUNCATE", \
                                  "col":"DESCRIPTION"}
  com.oracle.bigdata.errorOpt:  [{"action":"replace", \
                                  "value":"INVALID NUM", \
                                  "col":["CUST_NUM","ORDER_NUM"]} , \
                                 {"action":"reject", \
                                  "col":"ORDER_TOTAL}]
 )
 LOCATION ("hdfs:/usr/cust/summary/*"));

Parent topic: ORACLE_HDFS Access Parameters

17.3 ORACLE_HIVE Access Parameters

ORACLE_HIVE retrieves metadata about external data sources from the Hive catalog.

The default mapping of Hive data to columns in the external table are usually appropriate. However, some circumstances require special parameter settings, or you might want to override the default values for reasons of your own.

Default Parameter Settings for ORACLE_HIVE
Describes the default parameter settings for ORACLE_HIVE.
Optional Parameter Settings for ORACLE_HIVE
ORACLE_HIVE supports optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause.

Parent topic: ORACLE_HDFS and ORACLE_HIVE Access Drivers

17.3.1 Default Parameter Settings for ORACLE_HIVE

Describes the default parameter settings for ORACLE_HIVE.

If you omit all access parameters from the CREATE TABLE statement, then ORACLE_HIVE uses the following default values:

com.oracle.bigdata.tablename=name of external table
com.oracle.bigdata.overflow={"action":"error"}
com.oracle.bigdata.erroropt={"action":"setnull"}

Parent topic: ORACLE_HIVE Access Parameters

17.3.2 Optional Parameter Settings for ORACLE_HIVE

ORACLE_HIVE supports optional com.oracle.bigdata parameters, which you can specify in the opaque_format_spec clause.

Specifically:

The following example shows a CREATE TABLE statement in which multiple access parameters are set.

Example 17-2 Setting Multiple Access Parameters for ORACLE_HIVE

CREATE TABLE ORDER (cust_num VARCHAR2(10), 
                    order_num VARCHAR2(20), 
                    order_date DATE,
                    item_cnt NUMBER,
                    description VARCHAR2(100),
                    order_total (NUMBER8,2)) ORGANIZATION EXTERNAL 
(TYPE oracle_hive
 ACCESS PARAMETERS (
  com.oracle.bigdata.tableName:  order_db.order_summary
  com.oracle.bigdata.colMap:     {"col":"ITEM_CNT", \
                                  "field":"order_line_item_count"}
  com.oracle.bigdata.overflow:   {"action":"ERROR", \
                                  "col":"DESCRIPTION"}
  com.oracle.bigdata.errorOpt:   [{"action":"replace", \
                                   "value":"INV_NUM" , \
                                   "col":["CUST_NUM","ORDER_NUM"]} ,\
                                  {"action":"reject", \
                                   "col":"ORDER_TOTAL}]
));

Parent topic: ORACLE_HIVE Access Parameters

17.4 Descriptions of com.oracle.bigdata Parameters

This topic provides descriptions of the com.oracle.bigdata parameters used by the ORACLE_HIVE and ORACLE_HDFS access drivers.

com.oracle.bigdata.colmap
com.oracle.bigdata.datamode
com.oracle.bigdata.erroropt
com.oracle.bigdata.fields
com.oracle.bigdata.fileformat
com.oracle.bigdata.log.exec
com.oracle.bigdata.log.qc
Specifies how the access driver generates log files for a query.
com.oracle.bigdata.overflow
com.oracle.bigdata.rowformat
com.oracle.bigdata.tablename
The Hive parameter com.oracle.bigdata.tablename identifies the Hive table that contains the source data.

Parent topic: ORACLE_HDFS and ORACLE_HIVE Access Drivers

17.4.1 com.oracle.bigdata.colmap

Maps a column in the source data to a column in the Oracle external table. You can define one or multiple pairs of column mappings. Use this property when the source field names exceed the maximum length of Oracle column names, or when you want to use different column names in the external table.

Default Value

A column in the external table with the same name as the Hive column

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

colmap:

Description of the illustration colmap.eps

colmap_entry:

Description of the illustration colmap_entry.eps

Semantics

"col":name

"col": The keyword must be lowercase and enclosed in quotation marks.

name: The name of a column in the Oracle external table. It is case sensitive and must be enclosed in quotation marks.

"field":name

"field": The keyword must be lowercase and enclosed in quotation marks.

name: The name of a field in the data source. It is not case sensitive, but it must be enclosed in quotation marks. See Syntax Rules for Specifying Properties.

Examples

This example maps a Hive column named ORDER_LINE_ITEM_COUNT to an Oracle column named ITEM_CNT:

com.oracle.bigdata.colMap={"col":"ITEM_CNT", \
                           "field":"order_line_item_count"}

The following example shows the mapping of multiple columns.

com.oracle.bigdata.colmap:[{"col":"KOL1", "field":"PROJECT_NAME"},{ "col":"KOL2","field":"wsdl_name"},{"col":"KOL3", "field":"method"}]

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.2 com.oracle.bigdata.datamode

Specifies the method that SmartScan uses to scan a Hadoop data source. The method can make a significant difference in performance.

Default Value

automatic

Syntax

A JSON document with the keyword-value pairs shown in the following diagram:

datamode:

Description of the illustration datamode.eps

Semantics

automatic

Automatically selects the appropriate mode, based on the metadata. It selects c mode if possible, or java mode if the data contains formats that are not supported by c mode.

Uses Java to read the file buffers, but C code to process the data and convert it to Oracle format. Specify this mode for delimited data.

If the data contains formats that the C code does not support, then it returns an error.

java

Uses the Java SerDes and InputFormats to process the data and convert it to Oracle format. Specify this mode for Parquet, RCFile, and other data formats that require a SerDe.

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.3 com.oracle.bigdata.erroropt

Describes how to handle errors that occur while the value of a column is calculated.

Default Value

{"action":"setnull"}

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

erroropt:

Description of the illustration erroropt.eps

error_element:

Description of erroropt_element.eps follows

Description of the illustration erroropt_element.eps

Semantics

The "action", "reject", "setnull", "replace", "value", and "col" keywords must be lowercase and enclosed in quotation marks. See Syntax Rules for Specifying Properties.

"action":value

value: One of these keywords:

"reject": Does not load any rows.
"setnull": Sets the column to NULL.
"replace": Sets the column to the specified value.

"value":string

string: Replaces a bad value in the external table. It must be enclosed in quotation marks.

"col":name

name: Identifies a column in an external table. The column name is case sensitive, must be enclosed in quotation marks, and can be listed only once.

Example

This example sets the value of the CUST_NUM or ORDER_NUM columns to INVALID if the Hive value causes an error. For any other columns, an error just causes the Hive value to be rejected.

com.oracle.bigdata.errorOpt: {"action":"replace",\
                              "value":"INVALID", \
                              "col":["CUST_NUM","ORDER_NUM"]

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.4 com.oracle.bigdata.fields

Lists the field names and data types of the data source.

Default Value

Not defined

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

fields:

Description of the illustration fields.eps

data_type:

Description of the illustration data_type.eps

primitive_type:

Description of the illustration primitive_type.eps

Semantics

The syntax is the same as a field list for a Hive table. If you split the field list across multiple lines, you must use a backslash to escape the new line characters.

field_name

The name of the Hive field. Use only alphanumeric characters and underscores (_). The maximum length is 128 characters. Field names are case-insensitive.

data_type

The data type of the Hive field. Optional; the default is STRING. The character set must be UTF8.

The data type can be complex or primitive:

Hive Complex Data Types

ARRAY: Indexable list
MAP: Key-value tuples
STRUCT: List of elements
UNIONTYPE: Multiple data types

Hive Primitive Data Types

INT: 4 byte integer
BIGINT: 8 byte integer
SMALLINT: 2 byte integer
TINYINT: 1 byte integer
BOOLEAN: TRUE or FALSE
FLOAT: single precision
DOUBLE: double precision
STRING: character sequence

17.4.5 com.oracle.bigdata.fileformat

Describes the row format of the data source, based on the ROW FORMAT clause for a Hive table generated by ORACLE_HDFS.

Default Value

TEXTFILE

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram.

fileformat:

Description of the illustration fileformat.eps

Semantics

ORC

Optimized row columnar file format

PARQUET

Column-oriented, binary file format

RCFILE

Record columnar file format

SEQUENCEFILE

Compressed file format

TEXTFILE

Plain text file format

INPUTFORMAT

Identifies a Java class that can extract records from the data file.

OUTPUTFORMAT

Identifies a Java class that can format the output records in the desired format

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.6 com.oracle.bigdata.log.exec

Specifies how the access driver generates log files generated by the C code for a query, when it is running as parallel processes on CDH.

The access driver does not create or write log files when executing on a Hadoop cluster node; the parallel query processes write them. The log files from the Java code are controlled by log4j properties, which are specified in the configuration file or the access parameters.

Default Value

Not defined (no logging)

Syntax

[directory_object:]file_name_template

Semantics

directory_object

The Oracle directory object for the HDFS path on the Hadoop cluster where the log file is created.

file_name_template

A string used to generate file names. The following table describes the optional variables that you can use in the template.

Table 17-2 Variables for com.oracle.bigdata.log.exec

Variable	Value
`%p`	Operating system process identifier (PID)
`%a`	A number that uniquely identifies the process
`%%`	A percent sign (`%`)

Example

The following example generates log file names that include the PID and a unique number, such as xtlogp_hive14_3413_57:

com.oracle.bigdata.log.exec= xtlogp_hive14_%p_%a

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.7 com.oracle.bigdata.log.qc

Specifies how the access driver generates log files for a query.

Default Value

Not defined (no logging)

Syntax

[directory_object:]file_name_template

Semantics

directory_object

Name of an Oracle directory object that points to the path where the log files are written. If this value is omitted, then the logs are written to the default directory for the external table.

file_name_template

A string used to generate file names. Table 17-3 describes the optional variables that you can use in the string.

Table 17-3 Variables for com.oracle.bigdata.log.qc

Variable	Value
`%p`	Operating system process identifier (PID)
`%%`	A percent sign (`%`)

Example

This example creates log file names that include the PID and a percent sign, such as xtlogp_hive213459_%:

com.oracle.bigdata.log.qc= xtlogp_hive21%p_%%

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.8 com.oracle.bigdata.overflow

Describes how to handle string data that is too long for the columns in the external table. The data source can be character or binary. For Hive, the data source can also be STRUCT, UNIONTYPES, MAP, or ARRAY.

Default Value

{"action":"error"}

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram:

overflow ::=

Description of the illustration overflow.eps

overflow_element ::=

Description of the illustration overflow_element.eps

Semantics

The "action", "truncate", "error", and "col" tags must be lowercase and enclosed in quotation marks. See Syntax Rules for Specifying Properties.

"action":value

The value of "action" can be one of the following keywords:

truncate: Shortens the data to fit the column.
error: Throws an error. The com.oracle.bigdata.erroropt property controls the result of the error.

"col":name

name: Identifies a column in the external table. The name is case sensitive and must be enclosed in quotation marks.

Example

This example truncates the source data for the DESCRIPTION column, if it exceeds the column width:

com.oracle.bigdata.overflow={"action":"truncate", \
                             "col":"DESCRIPTION"}

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.9 com.oracle.bigdata.rowformat

Provides the information the access driver needs to extract fields from the records in a file.

Important:

The com.oracle.bigdata.rowformat is unrelated to the access parameter syntax of traditional external tables that use "type ORACLE_LOADER." There are keywords such as FIELDS, TERMINATED, and others that appear in both clauses, but the commonality in naming is coincidental and does not imply common functionality. The com.oracle.bigdata.rowformat access parameter is passed without change to the default Hive serde. The Hive serde to extract columns from rows is deliberately limited. Complex cases are handled by specialized serdes.

Default Value

DELIMITED

Syntax

A JSON document with the keyword-value pairs is shown in the following diagram.

rowformat:

Description of the illustration rowformat.eps

Semantics

DELIMITED

Describes the characters used to delimit the fields in a record:

FIELDS TERMINATED BY: The character that delimits every field in the record. The optional ESCAPED BY character precedes the delimit character when it appears within a field value.
COLLECTION ITEMS TERMINATED BY: The character that marks the end of an array element. Used when a column is a collection or a nested record. In this case the resulting value will be a JSON array.
MAP KEYS TERMINATED BY: The character that marks the end of an entry in a MAP field. Used when a column is a collection or a nested record. The resulting value is a JSON object.
LINES TERMINATED BY: The character that marks the end of a record.
NULL DEFINED AS: The character that indicates a null value.

SERDE

Identifies a SerDe that can parse the data and any properties of the SerDe that the access driver might need.

Example

This example specifies a SerDe for an Avro container file:

com.oracle.bigdata.rowformat:
   SERDE'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

The next example specifies a SerDe for a file containing regular expressions:

com.oracle.bigdata.rowformat=\
    SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' \
    WITH SERDEPROPERTIES \ 
    ("input.regex" = "(\\\\d{6}) (\\\\d{5}) (.{29}) .*")

Parent topic: Descriptions of com.oracle.bigdata Parameters

17.4.10 com.oracle.bigdata.tablename

The Hive parameter com.oracle.bigdata.tablename identifies the Hive table that contains the source data.

Default Value

DEFAULT.external_table_name

Syntax

[hive_database_name.]table_name

Semantics

The maximum length of hive_database_name and table_name is 128 UTF-8 characters (512 bytes).

hive_database_name: The Hive database where the source data resides. DEFAULT is the name of the initial Hive database.

table_name: The Hive table with the data. If you omit table_name, then ORACLE_HIVE searches for a Hive table with the same name as the external table. Table names are case-insensitive.

Example

This setting indicates that the source data is in a table named ORDER_SUMMARY in the Hive ORDER_DB database:

com.oracle.bigdata.tablename ORDER_DB.ORDER_SUMMARY

Parent topic: Descriptions of com.oracle.bigdata Parameters