程序包 org.apache.hadoop.mapred.join

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map.

请参阅: 说明

程序包org.apache.hadoop.mapred.join的说明

Given a set of sorted datasets keyed with the same class and yielding equal partitions, it is possible to effect a join of those datasets prior to the map. This could save costs in re-partitioning, sorting, shuffling, and writing out data required in the general case.

Interface

The attached code offers the following interface to users of these classes.

propertyrequiredvalue
mapred.join.expryes Join expression to effect over input data
mapred.join.keycomparatorno WritableComparator class to use for comparing keys
mapred.join.define.<ident>no Class mapped to identifier in join expression

The join expression understands the following grammar:

func ::= <ident>([<func>,]*<func>)
func ::= tbl(<class>,"<path>");

Operations included in this patch are partitioned into one of two types: join operations emitting tuples and "multi-filter" operations emitting a single value from (but not necessarily included in) a set of input values. For a given key, each operation will consider the cross product of all values for all sources at that node.

Identifiers supported by default:

identifiertypedescription
innerJoinFull inner join
outerJoinFull outer join
overrideMultiFilter For a given key, prefer values from the rightmost source

A user of this class must set the InputFormat for the job to CompositeInputFormat and define a join expression accepted by the preceding grammar. For example, both of the following are acceptable:

inner(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/bar"),
      tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/baz"))

outer(override(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/bar"),
               tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
                   "hdfs://host:8020/foo/baz")),
      tbl(org.apache.hadoop.mapred/SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/rab"))

CompositeInputFormat includes a handful of convenience methods to aid construction of these verbose statements.

As in the second example, joins may be nested. Users may provide a comparator class in the mapred.join.keycomparator property to specify the ordering of their keys, or accept the default comparator as returned by WritableComparator.get(keyclass).

Users can specify their own join operations, typically by overriding JoinRecordReader or MultiFilterRecordReader and mapping that class to an identifier in the join expression using the mapred.join.define.ident property, where ident is the identifier appearing in the join expression. Users may elect to emit- or modify- values passing through their join operation. Consulting the existing operations for guidance is recommended. Adding arguments is considerably more complex (and only partially supported), as one must also add a Node type to the parse tree. One is probably better off extending RecordReader in most cases.

JIRA

Copyright © 2009 The Apache Software Foundation