Tracer implementation based on the Google Dapper paper:
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36356.pdf
Instead using a data store to record the tracing information, SLF4J is used. The results will be logged and a separate process on the machine can be used to extract the
tracing information from the logs and send it to a collector/data store/log aggregator/etc (if desired). This is consistent with the behavior described in the Google Dapper
paper. Valid distributed trace spans will be logged to a SLF4J logger named
#VALID_WINGTIPS_SPAN_LOGGER_NAME so you can pipe them to their own log file if desired.
Similarly, invalid spans (where this class has detected incorrect usage and knows that the spans have invalid timing info, for example) will be logged to a SLF4J
logger named
#INVALID_WINGTIPS_SPAN_LOGGER_NAME. These specially-named loggers will not be used for any other purpose.
Sampling is determined using
#rootSpanSamplingStrategy which defaults to sampling everything. You can override this by calling
#setRootSpanSamplingStrategy(RootSpanSamplingStrategy).
You can be notified of span lifecycle events (i.e. for metrics counting) by adding a listener to
#addSpanLifecycleListener(SpanLifecycleListener).
NOTE: It's important that any
SpanLifecycleListener you add is extremely lightweight or you risk distributed tracing becoming a major bottleneck for
high throughput services. If any expensive work needs to be done in a
SpanLifecycleListener then it should be done asynchronously on a thread or
threadpool separate from the application worker threads.
The format of the logging output when a span is completed is determined by
#spanLoggingRepresentation, which can be set by calling
#setSpanLoggingRepresentation(SpanLoggingRepresentation). The default is
SpanLoggingRepresentation#JSON, which causes the
log messages to use
Span#toJSON() to represent the span.
The span information is associated with a thread and is modeled as a stack, so it's possible to have nested spans inside an overall request span. These nested spans are
referred to as "sub-spans" in this class. Sub-spans are started via
#startSubSpan(String,SpanPurpose) and completed via
#completeSubSpan().
See the recommended usage section below for more information.
RECOMMENDED USAGE: In a typical usage scenario you'll want to call one of the
startRequest...() methods as soon as possible when a request enters your
application, and you'll want to call
#completeRequestSpan() as late as possible in the request/response cycle (ideally after the last of the response has been
sent to the user, although some frameworks don't let you hook in that late). In between these two calls the span that was started (the "overall-request-span")
is considered the "current span" for this thread and can be retrieved if necessary by calling
#getCurrentSpan().
NOTE: Given the thread-local nature of this class you'll want to make sure the completion
call is in a finally block or otherwise guaranteed to be called no matter what (even if the request fails with an error) to prevent problems when subsequent requests
are processed on the same thread. This class does its best to recover from incorrect thread usage scenarios and log information about what happened
but the best solution is to prevent the problems from occurring in the first place.
ALSO NOTE:
Spans support Java try-with-resources statements to help guarantee proper usage in
blocking/non-async scenarios (for asynchronous scenarios please refer to
the asynchronous usage section of the Wingtips
readme). Here are some examples.
Overall request span using try-with-resources:
try(Span requestSpan = Tracer.getInstance().startRequestWith*(...)) {
// Traced blocking code for overall request (not asynchronous) goes here ...
}
// No finally block needed to properly complete the overall request span
Subspan using try-with-resources:
try (Span subspan = Tracer.getInstance().startSubSpan(...)) {
// Traced blocking code for subspan (not asynchronous) goes here ...
}
// No finally block needed to properly complete the subspan
The "request span" described above is intended to track the work done for the overall request. If you have work inside the request that you want tracked as a
separate nested span with the overall-request-span as its parent you can do so via the "sub-span" methods:
#startSubSpan(String,SpanPurpose) and
#completeSubSpan(). These nested sub-spans are pushed onto the span stack associated with the current thread and you can have them as deeply nested as you want,
but just like with the overall-request-span you'll want to make sure the completion method is called in a finally block or otherwise guaranteed to be executed even if
an error occurs. Each call to
#startSubSpan(String,SpanPurpose) causes the "current span" to be the new sub-span's parent, and causes the new sub-span to
become the "current span" by pushing it onto the span stack. Each call to
#completeSubSpan() does the reverse by popping the current span off the span stack
and completing and logging it, thus causing its parent to become the current span again.
One common use case for sub-spans is to track downstream calls separately from the overall request (e.g. HTTP calls to another service, database calls, or any other
call that crosses network or application boundaries). Start a sub-span immediately before a downstream call and complete it immediately after the downstream call
returns. You can inspect the sub-span to see how long the downstream call took from this application's point of view. If you do this around all your downstream calls
you can subtract the total time spent for all downstream calls from the time spent for the overall-request-span to determine time spent in this application vs. time
spent waiting for downstream calls to finish. And if the downstream service also performs distributed tracing and has an overall-request-span for its service call
then you can subtract the downstream service's request-span time from this application's sub-span time around the downstream call to determine how much time was
lost to network lag or any other bottlenecks between the services.
In addition to the logs this class outputs for spans it puts the trace ID and span JSON for the "current" span into the SLF4J
MDC so that all your logs can be tagged with the current span's trace ID and/or full JSON. To utilize this you
would need to add
%X{traceId}} and/or
%X{spanJson}} to your log pattern (NOTE: this only works with SLF4J frameworks that support MDC, e.g. logback
and log4j). This causes *all* log messages, including ones that come from third party libraries and have no knowledge of distributed tracing, to be output with the
current span's tracing information.
NOTE: Due to the thread-local nature of this class it is more effort to integrate with reactive (asynchronous non-blocking) frameworks like Netty or actor frameworks
than with thread-per-request frameworks. But it is not terribly difficult and the benefit of having all your log messages tagged with tracing information is worth
the effort. This class provides the following methods to help integrate with reactive frameworks:
-
#registerWithThread(Deque)
-
#unregisterFromThread()
-
#getCurrentSpanStackCopy()
See the javadocs on those methods for more detailed information, but the general pattern would be to call the
#registerWithThread(Deque) with the request's
span stack whenever a thread starts to do some chunk of work for that request, and call
#unregisterFromThread() when that chunk of work is done and the
thread is about to be freed up to work on a different request. The span stack would need to follow the request no matter what thread was processing it,
but assuming you can solve that problem in a reactive framework then the general pattern should work fine.
The asynchronous usage section of the Wingtips
readme contains further details on asynchronous Wingtips usage, including helper classes and methods to
automate or ease the handling of these scenarios. Please refer to that section of the readme if you have any
asynchronous use cases.