Background count in significant_terms not consistent?

Mark_Harwood · July 20, 2016, 1:05pm

It's a general issue with distributed analytics

I always say it comes down to a pick 2-of-3 conundrum between Fast Accurate and Big. It's about trade-offs.

If you wanted BA you'd use something like Hadoop and stream all term frequencies in a series of map-reduce phases until you got your answer. Accurate, but takes a while and may end up being the answer to yesterday's question.
We do BF which means we do things fast but often using sketches of data from shards which can lose accuracy
FA is the pre-big-data phase of computing when we had the luxury of doing everything on one machine.

So significant_terms is one of these analytical functions that can suffer with accuracy because of the distribution of the data. Perhaps a worst-case scenario is one where you have time-based indices and you query across them all to find the most significant IP address visiting your website today. Shards outside of the chosen time range will match nothing and so are not going to report background stats for every IP address that they have seen. The results are then not seeing the whole picture.
This is an extreme example where the physical distribution of the data is not conducive to the particular analysis you want to perform and so you would need to consider an alternative approach.

Topic		Replies	Views
Aggregation across multiple indexes/indices - significant terms Elasticsearch	5	623	March 17, 2022
Background Count (bg_count) Remains Zero in Nested and Filtered significant_terms Aggregation Elasticsearch	3	244	November 17, 2023
Bg_counts in nested significant_terms aggregation Elasticsearch	3	1276	July 5, 2017
Significant terms buggy due to wrong bg_count Elasticsearch	8	750	March 18, 2020
Detail questions about significant_terms aggregation Elasticsearch	1	322	July 6, 2017

Background count in significant_terms not consistent?

Related topics