Base class for Hadoop jobs.
This class defines a set of common methods and configuration shared by Hadoop jobs.
Jobs can be configured either by providing properties or by calling setters.
Each property has a corresponding setter.
This class recognizes the following properties:
- input.path - Input path job will read from
- output.path - Output path job will write to
- temp.path - Temporary path under which intermediate files are stored
- retention.count - Number of days to retain in output directory
- num.reducers - Number of reducers to use
- use.combiner - Whether to use a combiner or not
- counters.path - Path to store job counters in
The input.path property may be a comma-separated list of paths. When there is more
than one it implies a join is to be performed. Alternatively the paths may be listed separately.
For example, input.path.first and input.path.second define two separate input
paths.
The num.reducers fixes the number of reducers. When not set the number of reducers
is computed based on the input size.
The temp.path property defines the parent directory for temporary paths, not the
temporary path itself. Temporary paths are created under this directory with an hourglass-
prefix followed by a GUID.
The input and output paths are the only required parameters. The rest are optional.
Hadoop configuration may be provided by setting a property with the prefix hadoop-conf..
For example, mapred.min.split.size can be configured by setting property
hadoop-conf.mapred.min.split.size to the desired value.