Percentile Aggregation: Help me understand "Accuracy is proportional to q(1-q)."

I'm reading about percentile aggregations:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-percentile-aggregation.html

When discussing "TDigest" -- the algorithm used to calculate percentiles -- it says:

Accuracy is proportional to q(1-q).

I assume "q" refers to the percentile we want.

I forgot what the graph q*(1-q) looks like, so I did a simple test (in Python). I used 100 instead of 1 because the values I passed were in [1, 100] (this could be my first mistake! See below)

>>> map(lambda q: q*(100-q), range(1,100))
[99, 196, 291, 384, 475, 564, 651, 736, 819, 900, 979, 1056, 1131, 1204, 1275, 1
344, 1411, 1476, 1539, 1600, 1659, 1716, 1771, 1824, 1875, 1924, 1971, 2016, 205
9, 2100, 2139, 2176, 2211, 2244, 2275, 2304, 2331, 2356, 2379, 2400, 2419, 2436,
 2451, 2464, 2475, 2484, 2491, 2496, 2499, 2500, 2499, 2496, 2491, 2484, 2475, 2
464, 2451, 2436, 2419, 2400, 2379, 2356, 2331, 2304, 2275, 2244, 2211, 2176, 213
9, 2100, 2059, 2016, 1971, 1924, 1875, 1824, 1771, 1716, 1659, 1600, 1539, 1476,
 1411, 1344, 1275, 1204, 1131, 1056, 979, 900, 819, 736, 651, 564, 475, 384, 291
, 196, 99]

Maybe this is easier to understand:

>>> map(lambda q: q*(100-q), range(1,100, 10))
[99, 979, 1659, 2139, 2419, 2499, 2379, 2059, 1539, 819]

In both cases, when q is closer to the "middle", the value of q*(1-q) is larger. But this contradicts the next sentence in Elasticsearch documentation (emphasis added)

This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median.

Can someone help me understand?

Note, if I used 1, the results of q*(1-q) become negative, (presumably a "lower" accuracy?), so maybe that explains it:

>>> map(lambda q: q*(1-q), range(1,100, 10))
[0, -110, -420, -930, -1640, -2550, -3660, -4970, -6480, -8190]

The value of q is in [0, 1], as percentiles are represented as a fraction of 1. Note that the value of q(1 - q) is small when q is near 0 or q is near 1. Its extreme value occurs when the derivative of q -> q(1 - q) is zero. In this case, that derivative is q -> 1 - 2q which has a single zero at q = 1/2, representing the median. We can see this is a maximum since the derivative is positive when q < 1/2 and the derivative is negative when q > 1/2 (so the values of q -> q(1 - q) increase as q increases to 1/2 and the values of q -> q(1 - q) decrease as q increases away from 1/2).

Rewrite it as q(1 - q) = q - q^2 = -q^2 + q. From the last form we can see the graph is an upside down parabola (the coefficient on q^2 is -1) and from the first form we can see it has zeros at q = 0 and q = 1. Thus, the graph over [0, 1] is an upside-down parabola with zeros at q = 0 and q = 1, and a maximum value at q = 1/2.

Thank you. So here's the chart you're describing:

And as you said, the value of q*(1-q) "is a maximum" at q = 0.5; i.e. when q is nearest to the median (i.e. less extreme)...

But that contradicts what the documentation says:

that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles

So shouldn't it say:

Accuracy is inversely proportional to q(1-q).

In other words, they lead me to believe that the y-axis of my graph is proportional to accuracy, but that would lead me to say that less extreme percentiles (i.e. 0.5) are more accurate (since my graph reaches maximum at 0.5) -- and I think that's wrong.

The terminology is confusing, but it is correct here. Accuracy relative to q(1 - q) means that the error is bounded by a constant times q(1 - q). That is, there is a positive constant C so that |td(q) - f(q)| <= Cq(1 - q) where q -> td(q) is the t-digest estimate, and q -> f(q) is the actual value.

The intuition for why the extreme percentiles are the most accurate is due to the fact that at the extremes the t-digest algorithm uses small clusters and towards the median, larger clusters are used (the size of the clusters is proportional in size to (q(1 - q))^(1/2)).

I like what you said:

The error is bound by a constant [multiplied by] q(1-q)

In my overly-simplistic takeaway:

The error is proportional to q(1-q)

I don't propose to correct documentation, only improve my understanding.

And I also appreciate your intuitive explanation; smaller clusters = more accurate = at extremes.

I will learn more about this expression |td(q) - f(q)| <= Cq(1 - q) if/when I improve my understanding of algorithms in general and T-Digest specifically. I can explain it in plain English thanks to you, but I still need to learn "why".

Thanks again!

Sure, can you elaborate on the "why" question that you're trying to understand?

Maybe "how" is better than "why"
I'm new to algorithms, so in the same way someone can understand how nested for loops might result in O(n2^), I want to see how T-Digest results in q(1-q)

[I know am probably confusing my concepts here, you never mentioned "big O"... I think you understand what I mean-- how code can result in functions of accuracy or time or whatever]

Maybe "grok" is better than "understand"
You've explained "why/how" pretty nicely in plain-English, but I tend to dive deep into source-code itself to really "prove" of the origin of q(1-q)...

So that's what I mean :slight_smile: I doubt my journey in grokking will produce any better summary than what you've provided, though. I'll only followup here if it produces new questions.

Thanks for all your help, @jasontedor

Maybe it's a convention to talk about "accuracy" or "more accuracy" when really referring to an error function. (i.e. "flip it"; "less erorr" -> "more accurate")
Kind of like measuring "getting colder" even though we measure energy/heat. (i.e. "flip it"; "less heat" -> "more cold")

Even if it's not, that's how I'll remember it! :smiley:

It's effectively by design of t-digest. The basic idea behind t-digest is to perform k-means clustering on the data points. However, there is a nifty idea which is to only retain clusters that have size proportional to q(1 - q)n where n is the number of points (the constant of proportionality further dictates the accuracy). Because of this strict bound on the size of the clusters at the extreme, the error is kept small at the extremes and has size proportional to q(1 - q). I hope that helps at least guide your intuition when you read the code?

You are very much welcome. :smile: