Execution planner used by
AbstractPartitionCollapsingIncrementalJob and its derived classes.
This creates a plan to process partitioned input data and collapse the partitions into a single output.
To use this class, the input and output paths must be specified. In addition the desired input date
range can be specified through several methods. Then
#createPlan() can be called and the
execution plan will be created. The inputs to process will be available from
#getInputsToProcess(),
the number of reducers to use will be available from
#getNumReducers(), and the input schemas
will be available from
#getInputSchemas().
Previous output may be reused by using
#setReusePreviousOutput(boolean). If previous output exists
and it is to be reused then it will be available from
#getPreviousOutputToProcess(). New input data
to process that is after the previous output time range is available from
#getNewInputsToProcess().
Old input data to process that is before the previous output time range and should be subtracted from the
previous output is available from
#getOldInputsToProcess().
Configuration properties are used to configure a
ReduceEstimator instance. This is used to
calculate how many reducers should be used.
The number of reducers to use is based on the input data size and the
num.reducers.bytes.per.reducer property. This setting can be controlled more granularly
through num.reducers.input.bytes.per.reducer and num.reducers.previous.bytes.per.reducer.
Check
ReduceEstimator for more details on how the properties are used.