A Java interface for Spark program to implement. It provides access to
JavaSparkExecutionContext for
interacting with CDAP.
public class JavaSparkTest extends JavaSparkMain {
@Override
public void run(JavaSparkExecutionContext sec) throws Exception {
JavaSparkContext sc = new JavaSparkContext();
// Create a RDD from dataset "input", with event body decoded as UTF-8 String
JavaRDD<String> inputRDD = sec.fromDataset("input").values();
// Create a RDD from dataset "lookup", which represents a lookup table from String to Long
JavaPairRDD<String, Long> lookupRDD = sec.fromDataset("lookup");
// Join the "input" input with the "lookup" dataset and save it to "output" dataset
JavaPairRDD<String, Long> resultRDD = inputRDD
.mapToPair(new PairFunction<String, String, String>() {
@Override
public Tuple2<String, String> call(String s) throws Exception {
return Tuple2.apply(s, s);
}
})
.join(lookupRDD)
.mapValues(new Function<Tuple2<String, Long>, Long>() {
@Override
public Long call(Tuple2<String, Long> v1) throws Exception {
return v1._2;
}
});
sec.saveAsDataset(resultRDD, "output");
}
}
This interface extends serializable because the closures are anonymous class in Java and Spark Serializes the
closures before sending it to worker nodes. This serialization of inner anonymous class expects the outer
containing class to be serializable else
NotSerializableException is thrown. Having this interface
serializable gives a neater API.