Percentiles aggregation does not behave as expected


I've been experimenting with percentile aggregations on some dataset after receiving some odd results in elasticsearch 6.3.2.

My dataset consists of 467 floating point numbers, which can be found here. Just to rule out different implementations of the analyzed values, I've ingested this dataset (each value as a single document) into two indices - one which defines the field as a float, and one which defines the field as a scaled_float with "scaling_factor": 1000. The percentiles requested are "percents": [50, 90, 95, 99, 99.99]. In both cases, the results are very similar and are very different from the percentiles expected (vs. Python's NumPy):

As documented, I know percentiles are approximated. However, it seems that the approximation behaves exactly opposite to what is written in the documentation:

  • Accuracy is proportional to q(1-q) . This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median
  • For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).

My dataset is pretty small (only 467 samples), and the accuracy increases as we approach the median. How can this be explained?

How many shards does your index have? If you are using the default 5 primary shards, can you try with a single primary shard as well?

Hey @Christian_Dahlqvist, in both cases it was tested on a single shard.

Hey @Christian_Dahlqvist
What else can we check on the issue?

Have you tried the settings to balance memory usage and accuracy?

Yep, I've experimented with different levels of the compression parameter (up to 2,000), but it does not seem to have any effect (maybe the sample size is small enough such that additional 'nodes' do not add accuracy).

I've opened an issue in github on the subject -

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.