OrcInputFormat (Hive 1.2.2 API)

java.lang.Object
- org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

All Implemented Interfaces:

VectorizedInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, CombineHiveInputFormat.AvoidSplitCombination, InputFormatChecker, org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
```
public class OrcInputFormat
extends Object
implements org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, InputFormatChecker, VectorizedInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, CombineHiveInputFormat.AvoidSplitCombination
```
A MapReduce/Hive input format for ORC files.
This class implements both the classic InputFormat, which stores the rows directly, and AcidInputFormat, which stores a series of events with the following schema:
```
   class AcidEvent<ROW> {
     enum ACTION {INSERT, UPDATE, DELETE}
     ACTION operation;
     long originalTransaction;
     int bucket;
     long rowId;
     long currentTransaction;
     ROW row;
   }
 
```
Each AcidEvent object corresponds to an update event. The originalTransaction, bucket, and rowId are the unique identifier for the row. The operation and currentTransaction are the operation and the transaction that added this event. Insert and update events include the entire row, while delete events have null for row.

Nested Class Summary

Nested Classes
Modifier and Type	Class and Description
`static class`	`OrcInputFormat.NullKeyRecordReader` Return a RecordReader that is compatible with the Hive 0.12 reader with NullWritable for the key instead of RecordIdentifier.

Nested classes/interfaces inherited from interface org.apache.hadoop.hive.ql.io.AcidInputFormat
AcidInputFormat.AcidRecordReader<K,V>, AcidInputFormat.Options, AcidInputFormat.RawReader<V>, AcidInputFormat.RowReader<V>

Constructor Summary

Constructors
Constructor and Description

OrcInputFormat()

Constructors
Constructor and Description
`OrcInputFormat()`

Method Summary

Methods
Modifier and Type	Method and Description
`static RecordReader`	`createReaderFromFile(Reader file, org.apache.hadoop.conf.Configuration conf, long offset, long length)`
`static boolean[]`	`genIncludedColumns(List<OrcProto.Type> types, org.apache.hadoop.conf.Configuration conf, boolean isOriginal)` Take the configuration and figure out which columns we need to include.
`static boolean[]`	`genIncludedColumns(List<OrcProto.Type> types, List<Integer> included, boolean isOriginal)`
`AcidInputFormat.RawReader<OrcStruct>`	`getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, ValidTxnList validTxnList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory)` Get a reader that returns the raw ACID events (insert, update, delete).
`AcidInputFormat.RowReader<OrcStruct>`	`getReader(org.apache.hadoop.mapred.InputSplit inputSplit, AcidInputFormat.Options options)` Get a record reader that provides the user-facing view of the data after it has been merged together.
`org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct>`	`getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit, org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.mapred.Reporter reporter)`
`static String[]`	`getSargColumnNames(String[] originalColumnNames, List<OrcProto.Type> types, boolean[] includedColumns, boolean isOriginal)`
`org.apache.hadoop.mapred.InputSplit[]`	`getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits)`
`static boolean`	`isOriginal(Reader file)`
`boolean`	`shouldSkipCombine(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf)`
`boolean`	`validateInput(org.apache.hadoop.fs.FileSystem fs, HiveConf conf, ArrayList<org.apache.hadoop.fs.FileStatus> files)` This method is used to validate the input files.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail
- OrcInputFormat
```
public OrcInputFormat()
```

Method Detail

shouldSkipCombine

public boolean shouldSkipCombine(org.apache.hadoop.fs.Path path,
                        org.apache.hadoop.conf.Configuration conf)
                          throws IOException

Specified by:: shouldSkipCombine in interface CombineHiveInputFormat.AvoidSplitCombination
Throws:: IOException

createReaderFromFile

public static RecordReader createReaderFromFile(Reader file,
                                org.apache.hadoop.conf.Configuration conf,
                                long offset,
                                long length)
                                         throws IOException

Throws:: IOException

isOriginal

public static boolean isOriginal(Reader file)

genIncludedColumns

public static boolean[] genIncludedColumns(List<OrcProto.Type> types,
                           List<Integer> included,
                           boolean isOriginal)

genIncludedColumns
```
public static boolean[] genIncludedColumns(List<OrcProto.Type> types,
                           org.apache.hadoop.conf.Configuration conf,
                           boolean isOriginal)
```
Take the configuration and figure out which columns we need to include.

Parameters:
types - the types for the file
conf - the configuration
isOriginal - is the file in the original format?

getSargColumnNames

public static String[] getSargColumnNames(String[] originalColumnNames,
                          List<OrcProto.Type> types,
                          boolean[] includedColumns,
                          boolean isOriginal)

validateInput

public boolean validateInput(org.apache.hadoop.fs.FileSystem fs,
                    HiveConf conf,
                    ArrayList<org.apache.hadoop.fs.FileStatus> files)
                      throws IOException

Description copied from interface: InputFormatChecker

This method is used to validate the input files.

Specified by:: validateInput in interface InputFormatChecker
Throws:: IOException

getSplits

public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job,
                                              int numSplits)
                                                throws IOException

Specified by:: getSplits in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
Throws:: IOException

getRecordReader

public org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit,
                                                                                                 org.apache.hadoop.mapred.JobConf conf,
                                                                                                 org.apache.hadoop.mapred.Reporter reporter)
                                                                                                   throws IOException

Specified by:: getRecordReader in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>
Throws:: IOException

getReader
```
public AcidInputFormat.RowReader<OrcStruct> getReader(org.apache.hadoop.mapred.InputSplit inputSplit,
                                             AcidInputFormat.Options options)
                                               throws IOException
```
Description copied from interface: AcidInputFormat

Get a record reader that provides the user-facing view of the data after it has been merged together. The key provides information about the record's identifier (transaction, bucket, record id).

Specified by:

getReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>

Parameters:
inputSplit - the split to read
options - the options to read with

Returns:
a record reader

Throws:

IOException

getRawReader
```
public AcidInputFormat.RawReader<OrcStruct> getRawReader(org.apache.hadoop.conf.Configuration conf,
                                                boolean collapseEvents,
                                                int bucket,
                                                ValidTxnList validTxnList,
                                                org.apache.hadoop.fs.Path baseDirectory,
                                                org.apache.hadoop.fs.Path[] deltaDirectory)
                                                  throws IOException
```
Description copied from interface: AcidInputFormat

Get a reader that returns the raw ACID events (insert, update, delete). Should only be used by the compactor.

Specified by:

getRawReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>

Parameters:
conf - the configuration
collapseEvents - should the ACID events be collapsed so that only the last version of the row is kept.
bucket - the bucket to read
validTxnList - the list of valid transactions to use
baseDirectory - the base directory to read or the root directory for old style files
deltaDirectory - a list of delta files to include in the merge

Returns:
a record reader

Throws:

IOException

Class OrcInputFormat

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.hadoop.hive.ql.io.AcidInputFormat

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

OrcInputFormat

Method Detail

shouldSkipCombine

createReaderFromFile

isOriginal

genIncludedColumns

genIncludedColumns

getSargColumnNames

validateInput

getSplits

getRecordReader

getReader

getRawReader