Returns a fluent-API builder with which you can create an
StreamSource for a Jet pipeline. The source can
emit items with native timestamps, which you can enable by calling
StreamSourceStage#withNativeTimestamps. It will use
Processor#isCooperative() processors.
Each parallel processor that drives your source has its private instance
of a state object it gets from the given
createFn. To get the
data items to emit to the pipeline, the processor repeatedly calls your
fillBufferFn with the state object and a buffer object. The
buffer's
SourceBuilder.TimestampedSourceBuffer#add method
takes two arguments: the item and the timestamp in milliseconds.
Your function should add some items to the buffer, ideally those it has
ready without having to block. A hundred items at a time is enough to
eliminate any per-call overheads within Jet. If it doesn't have any
items ready, it may also return without adding anything. In any case the
function should not take more than a second or so to complete, otherwise
you risk interfering with Jet's coordination mechanisms and getting bad
performance.
Unless you call
SourceBuilder.TimestampedStream#distributed(int), Jet will create just a single processor that
should emit all the data. If you do call it, make sure your distributed
source takes care of splitting the data between processors. Your
createFn should consult
Context#totalParallelism() and
Context#globalProcessorIndex(). Jet calls it exactly once with each
globalProcessorIndex from 0 to
totalParallelism - 1 and
each of the resulting state objects must emit its unique slice of the
total source data.
Here's an example that builds a simple, non-distributed source that
polls a URL and emits all the lines it gets in the response,
interpreting the first 9 characters as the timestamp.
StreamSource socketSource = SourceBuilder);
})
.destroyFn(Closeable::close)
.build();
Pipeline p = Pipeline.create();
StreamStage srcStage = p.drawFrom(socketSource)
.withNativeTimestamps(SECONDS.toMillis(5));
}
NOTE 1: the source you build with this builder is not
fault-tolerant. You shouldn't use it in jobs that require a processing
guarantee. Use
Sources#streamFromProcessorWithWatermarks(String,Function) if you need fault tolerance.
NOTE 2: if the data source you're adapting to Jet is
partitioned, you may run into issues with event skew between partitions
assigned to single parallel processor. The timestamp you get from one
partition may be significantly behind the timestamp you already got from
another partition. If the skew is more than the allowed lag you
StreamSourceStage#withNativeTimestamps(long),
you risk that the events will be dropped. Use
Sources#streamFromProcessorWithWatermarks(String,Function) if you need to coalesce watermarks from multiple partitions.