Significant terms buggy due to wrong bg_count

Yonatan_Omer · February 17, 2020, 2:13pm

We use significant terms to find services related to specified group of users.
each user is a document, having a field 'services' which is an array of strings.

The query bellow returns false report of anomalies due to totally buggy bg_count:

  "aggregations" : {
    "anomalies" : {
      "doc_count" : 15828,
      "bg_count" : 1734146,
      "buckets" : [
        {
          "key" : "QWAVE: 1",
          "doc_count" : 15512,
          "score" : 21.78548921172707,
          "bg_count" : 73163    **// in reality the count is 1500000!!!**
        }

The query:

POST /winqual_stats2/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"productVersion": "20.1"
}
}
]
}
},
"aggregations": {
"anomalies": {
"significant_terms": {
"field": "serviceList",
"size": 10
}
}
}
}

Mark_Harwood · February 17, 2020, 5:33pm

Have you read the docs on the reasons for inaccuracies?
If you can put all your content in one shard that helps. If not, increasing shard_size can help. You can always do a follow-up query too to flesh out true numbers

Yonatan_Omer · February 18, 2020, 10:44am

Hi, yes I have read the docs :

the returned list can be slightly off and not accurate (it could be that the term counts are slightly off

We can tolerate 20% inaccuracies, but in our case the real count is 20 times higher.
Even in such settings the inaccuracy is huge:

size : 20, shard_size: 100

I tried many other relations between size and shard_size, all return max 8% of the real count.
the query respond with :

"_shards" : {
"total" : 160,
"successful" : 160,
"skipped" : 0,
"failed" : 0

}

I would appreciate advise about this , thank you!

Mark_Harwood · February 18, 2020, 11:06am

It depends a lot on the distribution of data. In order of accuracy:

If you have all data in one place (1 index + 1 shard) the algos can see everything and it's all accurate
If you have data spread across a handful of shards and no bias in document routing (e.g. NO time-based indices) then each shard sees roughly the same randomised subset of the original data. The algos running on each shard should narrow in on the same set of top N terms they discover and return. Setting the shard_size value to a larger multiple of the size will increase the chances that all shards will contribute all the counts for the finally selected terms in the results fusion
If you have multiple shards and unevenly distributed data (e.g. using time based indices) then each shard has a biased view of the world and some questions may be impossible to answer at all. The worst case scenario is the "trending topics" question where you are asking what is special about today Vs all history and yet today's content is held on a machine away from the rest of history - the data is not physically organised in a way that allows the question to be answered.

To use an analogy - imagine shards are like continents in ancient times. You could ask each continent what they saw as the difference between, say adults and children and they'd each have enough local examples to draw valid and similar conclusions (weight, height, hair) etc. Now imagine asking continents what the difference is between eskimos and nomadic tribes of the Sahara. If you've never seen an eskimo you don't even know where to start. Physical distribution matters when it comes to this diffing.

How is your data distributed? How many shards? Using time-based indices? Using custom routing?

Yonatan_Omer · February 19, 2020, 12:47pm

Thank you, this is much more clear now!
We have 160 shards, data should be evenly distributed by date and
we don't use it to find anomalies per time-window.

Mark_Harwood · February 19, 2020, 1:51pm

Another consideration is that each shard will not volunteer background stats for all the terms that are not in your search results (there could be billions of terms unconnected to your search results). If your query matches only a few or no documents on a shard then it may not hit on terms that are found in other shards. That will give an incomplete figure.
160 shards sounds a lot - how many documents do you have in each shard?

Yonatan_Omer · February 19, 2020, 3:00pm

I our cluster is used for various indexes, some of which huge, my case is smaller:
2M documents on 160 shards. each document represent a user, and containing a list of thousands events and other stats.
It's used to find anomalies at groups of users matching a multi term query which may include a match_phrase.

We use background_filter to compare the matched users to a subset of the entire population.

Mark_Harwood · February 19, 2020, 3:03pm

That seems a massive number of shards for 2m documents?

system · March 18, 2020, 3:03pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Significant terms aggregation returns incorrect bg_count value when querying index with nested objects in version 8.3.3 Elasticsearch docker	3	247	May 16, 2023
Bg_counts in nested significant_terms aggregation Elasticsearch	3	1276	July 5, 2017
Detail questions about significant_terms aggregation Elasticsearch	1	322	July 6, 2017
Background count in significant_terms not consistent? Elasticsearch	8	3583	July 5, 2017
Sum_other_doc_count higher than total docs Elasticsearch	3	1663	August 16, 2018

Significant terms buggy due to wrong bg_count

Related topics