The
AggregateFunction is a flexible aggregation function, characterized by the
following features:
- The aggregates may use different types for input values, intermediate aggregates,
and result type, to support a wide range of aggregation types.
- Support for distributive aggregations: Different intermediate aggregates can be
merged together, to allow for pre-aggregation/final-aggregation optimizations.
The
AggregateFunction's intermediate aggregate (in-progress aggregation state)
is called the accumulator. Values are added to the accumulator, and final aggregates are
obtained by finalizing the accumulator state. This supports aggregation functions where the
intermediate state needs to be different than the aggregated values and the final result type,
such as for example average (which typically keeps a count and sum).
Merging intermediate aggregates (partial aggregates) means merging the accumulators.
The AggregationFunction itself is stateless. To allow a single AggregationFunction
instance to maintain multiple aggregates (such as one aggregate per key), the
AggregationFunction creates a new accumulator whenever a new aggregation is started.
Aggregation functions must be
Serializable because they are sent around
between distributed processes during distributed execution.
Example: Average and Weighted Average
// the accumulator, which holds the state of the in-flight aggregate// implementation of an aggregation function for an 'average'
public class Average implements AggregateFunction
public AverageAccumulator createAccumulator()
return new AverageAccumulator();
}
public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b)
a.count += b.count;
a.sum += b.sum;
return a;
}
public void add(Integer value, AverageAccumulator acc)
acc.sum += value;
acc.count++;
}
public Double getResult(AverageAccumulator acc)
return acc.sum / (double) acc.count;
}
}
// implementation of a weighted average
// this reuses the same accumulator type as the aggregate function for 'average'
public class WeightedAverage implements AggregateFunction
public AverageAccumulator createAccumulator()
return new AverageAccumulator();
}
public AverageAccumulator merge(AverageAccumulator a, AverageAccumulator b)
a.count += b.count;
a.sum += b.sum;
return a;
}
public void add(Datum value, AverageAccumulator acc)
acc.count += value.getWeight();
acc.sum += value.getValue();
}
public Double getResult(AverageAccumulator acc)
return acc.sum / (double) acc.count;
}
}
}