I'm reading about percentile aggregations:
When discussing "TDigest" -- the algorithm used to calculate percentiles -- it says:
Accuracy is proportional to q(1-q).
I assume "q" refers to the percentile we want.
I forgot what the graph q*(1-q) looks like, so I did a simple test (in Python). I used 100 instead of 1 because the values I passed were in [1, 100] (this could be my first mistake! See below)
>>> map(lambda q: q*(100-q), range(1,100))
[99, 196, 291, 384, 475, 564, 651, 736, 819, 900, 979, 1056, 1131, 1204, 1275, 1
344, 1411, 1476, 1539, 1600, 1659, 1716, 1771, 1824, 1875, 1924, 1971, 2016, 205
9, 2100, 2139, 2176, 2211, 2244, 2275, 2304, 2331, 2356, 2379, 2400, 2419, 2436,
2451, 2464, 2475, 2484, 2491, 2496, 2499, 2500, 2499, 2496, 2491, 2484, 2475, 2
464, 2451, 2436, 2419, 2400, 2379, 2356, 2331, 2304, 2275, 2244, 2211, 2176, 213
9, 2100, 2059, 2016, 1971, 1924, 1875, 1824, 1771, 1716, 1659, 1600, 1539, 1476,
1411, 1344, 1275, 1204, 1131, 1056, 979, 900, 819, 736, 651, 564, 475, 384, 291
, 196, 99]
Maybe this is easier to understand:
>>> map(lambda q: q*(100-q), range(1,100, 10))
[99, 979, 1659, 2139, 2419, 2499, 2379, 2059, 1539, 819]
In both cases, when q
is closer to the "middle", the value of q*(1-q) is larger. But this contradicts the next sentence in Elasticsearch documentation (emphasis added)
This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median.
Can someone help me understand?
Note, if I used 1, the results of q*(1-q) become negative, (presumably a "lower" accuracy?), so maybe that explains it:
>>> map(lambda q: q*(1-q), range(1,100, 10))
[0, -110, -420, -930, -1640, -2550, -3660, -4970, -6480, -8190]