Background count in significant_terms not consistent?

It's a general issue with distributed analytics :slight_smile:

I always say it comes down to a pick 2-of-3 conundrum between Fast Accurate and Big. It's about trade-offs.

  • If you wanted BA you'd use something like Hadoop and stream all term frequencies in a series of map-reduce phases until you got your answer. Accurate, but takes a while and may end up being the answer to yesterday's question.
  • We do BF which means we do things fast but often using sketches of data from shards which can lose accuracy
  • FA is the pre-big-data phase of computing when we had the luxury of doing everything on one machine.

So significant_terms is one of these analytical functions that can suffer with accuracy because of the distribution of the data. Perhaps a worst-case scenario is one where you have time-based indices and you query across them all to find the most significant IP address visiting your website today. Shards outside of the chosen time range will match nothing and so are not going to report background stats for every IP address that they have seen. The results are then not seeing the whole picture.
This is an extreme example where the physical distribution of the data is not conducive to the particular analysis you want to perform and so you would need to consider an alternative approach.

2 Likes