This is a stochastic streaming sketch that enables near-real time analysis of the
approximate distribution of real values from a very large stream in a single pass.
The analysis is obtained using a getQuantiles(*) function or its inverse functions the
Probability Mass Function from getPMF(*) and the Cumulative Distribution Function from getCDF(*).
Consider a large stream of one million values such as packet sizes coming into a network node.
The absolute rank of any specific size value is simply its index in the hypothetical sorted
array of values.
The normalized rank (or fractional rank) is the absolute rank divided by the stream size,
in this case one million.
The value corresponding to the normalized rank of 0.5 represents the 50th percentile or median
value of the distribution, or getQuantile(0.5). Similarly, the 95th percentile is obtained from
getQuantile(0.95). Using the getQuantiles(0.0, 1.0) will return the min and max values seen by
the sketch.
From the min and max values, for example, 1 and 1000 bytes,
you can obtain the PMF from getPMF(100, 500, 900) that will result in an array of
4 fractional values such as {.4, .3, .2, .1}, which means that
- 40% of the values were < 100,
- 30% of the values were ≥ 100 and < 500,
- 20% of the values were ≥ 500 and < 900, and
- 10% of the values were ≥ 900.
A frequency histogram can be obtained by simply multiplying these fractions by getN(),
which is the total count of values received.
The getCDF(*) works similarly, but produces the cumulative distribution instead.
The accuracy of this sketch is a function of the configured value k, which also affects
the overall size of the sketch. Accuracy of this quantile sketch is always with respect to
the normalized rank. A k of 128 produces a normalized, rank error of about 1.7%.
For example, the median value returned from getQuantile(0.5) will be between the actual values
from the hypothetically sorted array of input values at normalized ranks of 0.483 and 0.517, with
a confidence of about 99%.
Table Guide for DoublesSketch Size in Bytes and Approximate Error:
K => | 16 32 64 128 256 512 1,024
~ Error => | 12.145% 6.359% 3.317% 1.725% 0.894% 0.463% 0.239%
N | Size in Bytes ->
------------------------------------------------------------------------
0 | 8 8 8 8 8 8 8
1 | 72 72 72 72 72 72 72
3 | 72 72 72 72 72 72 72
7 | 104 104 104 104 104 104 104
15 | 168 168 168 168 168 168 168
31 | 296 296 296 296 296 296 296
63 | 424 552 552 552 552 552 552
127 | 552 808 1,064 1,064 1,064 1,064 1,064
255 | 680 1,064 1,576 2,088 2,088 2,088 2,088
511 | 808 1,320 2,088 3,112 4,136 4,136 4,136
1,023 | 936 1,576 2,600 4,136 6,184 8,232 8,232
2,047 | 1,064 1,832 3,112 5,160 8,232 12,328 16,424
4,095 | 1,192 2,088 3,624 6,184 10,280 16,424 24,616
8,191 | 1,320 2,344 4,136 7,208 12,328 20,520 32,808
16,383 | 1,448 2,600 4,648 8,232 14,376 24,616 41,000
32,767 | 1,576 2,856 5,160 9,256 16,424 28,712 49,192
65,535 | 1,704 3,112 5,672 10,280 18,472 32,808 57,384
131,071 | 1,832 3,368 6,184 11,304 20,520 36,904 65,576
262,143 | 1,960 3,624 6,696 12,328 22,568 41,000 73,768
524,287 | 2,088 3,880 7,208 13,352 24,616 45,096 81,960
1,048,575 | 2,216 4,136 7,720 14,376 26,664 49,192 90,152
2,097,151 | 2,344 4,392 8,232 15,400 28,712 53,288 98,344
4,194,303 | 2,472 4,648 8,744 16,424 30,760 57,384 106,536
8,388,607 | 2,600 4,904 9,256 17,448 32,808 61,480 114,728
16,777,215 | 2,728 5,160 9,768 18,472 34,856 65,576 122,920
33,554,431 | 2,856 5,416 10,280 19,496 36,904 69,672 131,112
67,108,863 | 2,984 5,672 10,792 20,520 38,952 73,768 139,304
134,217,727 | 3,112 5,928 11,304 21,544 41,000 77,864 147,496
268,435,455 | 3,240 6,184 11,816 22,568 43,048 81,960 155,688
536,870,911 | 3,368 6,440 12,328 23,592 45,096 86,056 163,880
1,073,741,823 | 3,496 6,696 12,840 24,616 47,144 90,152 172,072
2,147,483,647 | 3,624 6,952 13,352 25,640 49,192 94,248 180,264
4,294,967,295 | 3,752 7,208 13,864 26,664 51,240 98,344 188,456
There is more documentation available on
DataSketches.GitHub.io.
This is an implementation of the Low Discrepancy Mergeable Quantiles Sketch, using double
values, described in section 3.2 of the journal version of the paper "Mergeable Summaries"
by Agarwal, Cormode, Huang, Phillips, Wei, and Yi.
This algorithm is independent of the distribution of values, which can be anywhere in the
range of the IEEE-754 64-bit doubles.
This algorithm intentionally inserts randomness into the sampling process for values that
ultimately get retained in the sketch. The results produced by this algorithm are not
deterministic. For example, if the same stream is inserted into two different instances of this
sketch, the answers obtained from the two sketches may not be be identical.
Similarly, there may be directional inconsistencies. For example, the resulting array of
values obtained from getQuantiles(fractions[]) input into the reverse directional query
getPMF(splitPoints[]) may not result in the original fractional values.