public class OrcInputFormat extends Object implements org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, InputFormatChecker, VectorizedInputFormatInterface, AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>, CombineHiveInputFormat.AvoidSplitCombination
This class implements both the classic InputFormat, which stores the rows directly, and AcidInputFormat, which stores a series of events with the following schema:
class AcidEvent<ROW> {
enum ACTION {INSERT, UPDATE, DELETE}
ACTION operation;
long originalTransaction;
int bucket;
long rowId;
long currentTransaction;
ROW row;
}
Each AcidEvent object corresponds to an update event. The
originalTransaction, bucket, and rowId are the unique identifier for the row.
The operation and currentTransaction are the operation and the transaction
that added this event. Insert and update events include the entire row, while
delete events have null for row.| Modifier and Type | Class and Description |
|---|---|
static class |
OrcInputFormat.NullKeyRecordReader
Return a RecordReader that is compatible with the Hive 0.12 reader
with NullWritable for the key instead of RecordIdentifier.
|
AcidInputFormat.AcidRecordReader<K,V>, AcidInputFormat.Options, AcidInputFormat.RawReader<V>, AcidInputFormat.RowReader<V>| Constructor and Description |
|---|
OrcInputFormat() |
| Modifier and Type | Method and Description |
|---|---|
static RecordReader |
createReaderFromFile(Reader file,
org.apache.hadoop.conf.Configuration conf,
long offset,
long length) |
static boolean[] |
genIncludedColumns(List<OrcProto.Type> types,
org.apache.hadoop.conf.Configuration conf,
boolean isOriginal)
Take the configuration and figure out which columns we need to include.
|
static boolean[] |
genIncludedColumns(List<OrcProto.Type> types,
List<Integer> included,
boolean isOriginal) |
AcidInputFormat.RawReader<OrcStruct> |
getRawReader(org.apache.hadoop.conf.Configuration conf,
boolean collapseEvents,
int bucket,
ValidTxnList validTxnList,
org.apache.hadoop.fs.Path baseDirectory,
org.apache.hadoop.fs.Path[] deltaDirectory)
Get a reader that returns the raw ACID events (insert, update, delete).
|
AcidInputFormat.RowReader<OrcStruct> |
getReader(org.apache.hadoop.mapred.InputSplit inputSplit,
AcidInputFormat.Options options)
Get a record reader that provides the user-facing view of the data after
it has been merged together.
|
org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> |
getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit,
org.apache.hadoop.mapred.JobConf conf,
org.apache.hadoop.mapred.Reporter reporter) |
static String[] |
getSargColumnNames(String[] originalColumnNames,
List<OrcProto.Type> types,
boolean[] includedColumns,
boolean isOriginal) |
org.apache.hadoop.mapred.InputSplit[] |
getSplits(org.apache.hadoop.mapred.JobConf job,
int numSplits) |
static boolean |
isOriginal(Reader file) |
boolean |
shouldSkipCombine(org.apache.hadoop.fs.Path path,
org.apache.hadoop.conf.Configuration conf) |
boolean |
validateInput(org.apache.hadoop.fs.FileSystem fs,
HiveConf conf,
ArrayList<org.apache.hadoop.fs.FileStatus> files)
This method is used to validate the input files.
|
public boolean shouldSkipCombine(org.apache.hadoop.fs.Path path,
org.apache.hadoop.conf.Configuration conf)
throws IOException
shouldSkipCombine in interface CombineHiveInputFormat.AvoidSplitCombinationIOExceptionpublic static RecordReader createReaderFromFile(Reader file, org.apache.hadoop.conf.Configuration conf, long offset, long length) throws IOException
IOExceptionpublic static boolean isOriginal(Reader file)
public static boolean[] genIncludedColumns(List<OrcProto.Type> types, List<Integer> included, boolean isOriginal)
public static boolean[] genIncludedColumns(List<OrcProto.Type> types, org.apache.hadoop.conf.Configuration conf, boolean isOriginal)
types - the types for the fileconf - the configurationisOriginal - is the file in the original format?public static String[] getSargColumnNames(String[] originalColumnNames, List<OrcProto.Type> types, boolean[] includedColumns, boolean isOriginal)
public boolean validateInput(org.apache.hadoop.fs.FileSystem fs,
HiveConf conf,
ArrayList<org.apache.hadoop.fs.FileStatus> files)
throws IOException
InputFormatCheckervalidateInput in interface InputFormatCheckerIOExceptionpublic org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job,
int numSplits)
throws IOException
getSplits in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>IOExceptionpublic org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.NullWritable,OrcStruct> getRecordReader(org.apache.hadoop.mapred.InputSplit inputSplit, org.apache.hadoop.mapred.JobConf conf, org.apache.hadoop.mapred.Reporter reporter) throws IOException
getRecordReader in interface org.apache.hadoop.mapred.InputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>IOExceptionpublic AcidInputFormat.RowReader<OrcStruct> getReader(org.apache.hadoop.mapred.InputSplit inputSplit, AcidInputFormat.Options options) throws IOException
AcidInputFormatgetReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>inputSplit - the split to readoptions - the options to read withIOExceptionpublic AcidInputFormat.RawReader<OrcStruct> getRawReader(org.apache.hadoop.conf.Configuration conf, boolean collapseEvents, int bucket, ValidTxnList validTxnList, org.apache.hadoop.fs.Path baseDirectory, org.apache.hadoop.fs.Path[] deltaDirectory) throws IOException
AcidInputFormatgetRawReader in interface AcidInputFormat<org.apache.hadoop.io.NullWritable,OrcStruct>conf - the configurationcollapseEvents - should the ACID events be collapsed so that only
the last version of the row is kept.bucket - the bucket to readvalidTxnList - the list of valid transactions to usebaseDirectory - the base directory to read or the root directory for
old style filesdeltaDirectory - a list of delta files to include in the mergeIOExceptionCopyright © 2017 The Apache Software Foundation. All rights reserved.