A transform that performs equijoins across multiple schema
PCollections.
This transform has similarites to
CoGroupByKey, however works on PCollections that
have schemas. This allows users of the transform to simply specify schema fields to join on. The
output type of the transform is a
KV where the value contains one field for
every input PCollection and the key represents the fields that were joined on. By default the
cross product is not expanded, so all fields in the output row are array fields.
For example, the following demonstrates joining three PCollections on the "user" and "country"
fields.
TupleTag input1Tag = new TupleTag<>("input1");
In the above case, the key schema will contain the two string fields "user" and "country"; in
this case, the schemas for Input1, Input2, Input3 must all have fields named "user" and
"country". The value schema will contain three array of Row fields named "input1" "input2" and
"input3". The value Row contains all inputs that came in on any of the inputs for that key.
Standard join types (inner join, outer join, etc.) can be accomplished by expanding the cross
product of these arrays in various ways.
To put it in other words, the key schema is convertible to the following POJO:
{@literal @}DefaultSchema(JavaFieldSchema.class)
public class JoinedKey
public String user;
public String country;
}
PCollection keys = joined
.apply(Keys.create())
.apply(Convert.to(JoinedKey.class));
}
The value schema is convertible to the following POJO:
{@literal @}DefaultSchema(JavaFieldSchema.class)
public class JoinedValue
// The below lists contain all values from each of the three inputs that match on the given
// key.
public List input1;
public List input2;
public List input3;
}
PCollection values = joined
.apply(Values.create())
.apply(Convert.to(JoinedValue.class));
}
It's also possible to join between different fields in two inputs, as long as the types of
those fields match. In this case, fields must be specified for every input PCollection. For
example:
PCollection> joined = PCollectionTuple