public class HadoopMapReduceCommitProtocol extends FileCommitProtocol implements scala.Serializable, Logging
FileCommitProtocol implementation backed by an underlying Hadoop OutputCommitter
(from the newer mapreduce API, not the old mapred API).
Unlike Hadoop's OutputCommitter, this implementation is serializable.
param: jobId the job's or stage's id param: path the job's output path, or null if committer acts as a noop param: dynamicPartitionOverwrite If true, Spark will overwrite partition directories at runtime dynamically, i.e., we first write files under a staging directory with partition path, e.g. /path/to/staging/a=1/b=1/xxx.parquet. When committing the job, we first clean up the corresponding partition directories at destination path, e.g. /path/to/destination/a=1/b=1, and move files from staging directory to the corresponding partition directories under destination path.
FileCommitProtocol.EmptyTaskCommitMessage$, FileCommitProtocol.TaskCommitMessage| Constructor and Description |
|---|
HadoopMapReduceCommitProtocol(String jobId,
String path,
boolean dynamicPartitionOverwrite) |
| Modifier and Type | Method and Description |
|---|---|
void |
abortJob(org.apache.hadoop.mapreduce.JobContext jobContext)
Aborts a job after the writes fail.
|
void |
abortTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
Aborts a task after the writes have failed.
|
void |
commitJob(org.apache.hadoop.mapreduce.JobContext jobContext,
scala.collection.Seq<FileCommitProtocol.TaskCommitMessage> taskCommits)
Commits a job after the writes succeed.
|
FileCommitProtocol.TaskCommitMessage |
commitTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
Commits a task after the writes succeed.
|
static boolean |
deleteWithJob(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path,
boolean recursive) |
String |
newTaskTempFile(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext,
scala.Option<String> dir,
String ext)
Notifies the commit protocol to add a new file, and gets back the full path that should be
used.
|
String |
newTaskTempFileAbsPath(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext,
String absoluteDir,
String ext)
Similar to newTaskTempFile(), but allows files to committed to an absolute output location.
|
static void |
onTaskCommit(FileCommitProtocol.TaskCommitMessage taskCommit) |
void |
setupJob(org.apache.hadoop.mapreduce.JobContext jobContext)
Setups up a job.
|
void |
setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
Sets up a task within a job.
|
deleteWithJob, instantiate, onTaskCommitequals, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitinitializeLogging, initializeLogIfNecessary, initializeLogIfNecessary, isTraceEnabled, log_, log, logDebug, logDebug, logError, logError, logInfo, logInfo, logName, logTrace, logTrace, logWarning, logWarningpublic HadoopMapReduceCommitProtocol(String jobId,
String path,
boolean dynamicPartitionOverwrite)
public static boolean deleteWithJob(org.apache.hadoop.fs.FileSystem fs,
org.apache.hadoop.fs.Path path,
boolean recursive)
public static void onTaskCommit(FileCommitProtocol.TaskCommitMessage taskCommit)
public String newTaskTempFile(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext,
scala.Option<String> dir,
String ext)
FileCommitProtocolNote that the returned temp file may have an arbitrary path. The commit protocol only promises that the file will be at the location specified by the arguments after job commit.
A full file path consists of the following parts: 1. the base path 2. some sub-directory within the base path, used to specify partitioning 3. file prefix, usually some unique job id with the task id 4. bucket id 5. source specific file extension, e.g. ".snappy.parquet"
The "dir" parameter specifies 2, and "ext" parameter specifies both 4 and 5, and the rest are left to the commit protocol implementation to decide.
Important: it is the caller's responsibility to add uniquely identifying content to "ext" if a task is going to write out multiple files to the same dir. The file commit protocol only guarantees that files written by different tasks will not conflict.
newTaskTempFile in class FileCommitProtocoltaskContext - (undocumented)dir - (undocumented)ext - (undocumented)public String newTaskTempFileAbsPath(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext,
String absoluteDir,
String ext)
FileCommitProtocolImportant: it is the caller's responsibility to add uniquely identifying content to "ext" if a task is going to write out multiple files to the same dir. The file commit protocol only guarantees that files written by different tasks will not conflict.
newTaskTempFileAbsPath in class FileCommitProtocoltaskContext - (undocumented)absoluteDir - (undocumented)ext - (undocumented)public void setupJob(org.apache.hadoop.mapreduce.JobContext jobContext)
FileCommitProtocolsetupJob in class FileCommitProtocoljobContext - (undocumented)public void commitJob(org.apache.hadoop.mapreduce.JobContext jobContext,
scala.collection.Seq<FileCommitProtocol.TaskCommitMessage> taskCommits)
FileCommitProtocolcommitJob in class FileCommitProtocoljobContext - (undocumented)taskCommits - (undocumented)public void abortJob(org.apache.hadoop.mapreduce.JobContext jobContext)
FileCommitProtocolCalling this function is a best-effort attempt, because it is possible that the driver just crashes (or killed) before it can call abort.
abortJob in class FileCommitProtocoljobContext - (undocumented)public void setupTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
FileCommitProtocolsetupTask in class FileCommitProtocoltaskContext - (undocumented)public FileCommitProtocol.TaskCommitMessage commitTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
FileCommitProtocolcommitTask in class FileCommitProtocoltaskContext - (undocumented)public void abortTask(org.apache.hadoop.mapreduce.TaskAttemptContext taskContext)
FileCommitProtocolCalling this function is a best-effort attempt, because it is possible that the executor just crashes (or killed) before it can call abort.
abortTask in class FileCommitProtocoltaskContext - (undocumented)