Significant terms buggy due to wrong bg_count

We use significant terms to find services related to specified group of users.
each user is a document, having a field 'services' which is an array of strings.

The query bellow returns false report of anomalies due to totally buggy bg_count:

  "aggregations" : {
    "anomalies" : {
      "doc_count" : 15828,
      "bg_count" : 1734146,
      "buckets" : [
        {
          "key" : "QWAVE: 1",
          "doc_count" : 15512,
          "score" : 21.78548921172707,
          "bg_count" : 73163    **// in reality the count is 1500000!!!**
        }

The query:

POST /winqual_stats2/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"term": {
"productVersion": "20.1"
}
}
]
}
},
"aggregations": {
"anomalies": {
"significant_terms": {
"field": "serviceList",
"size": 10
}
}
}
}

Have you read the docs on the reasons for inaccuracies?
If you can put all your content in one shard that helps. If not, increasing shard_size can help. You can always do a follow-up query too to flesh out true numbers

Hi, yes I have read the docs :

the returned list can be slightly off and not accurate (it could be that the term counts are slightly off

We can tolerate 20% inaccuracies, but in our case the real count is 20 times higher.
Even in such settings the inaccuracy is huge:

size : 20, shard_size: 100

I tried many other relations between size and shard_size, all return max 8% of the real count.
the query respond with :

"_shards" : {
"total" : 160,
"successful" : 160,
"skipped" : 0,
"failed" : 0

}

I would appreciate advise about this , thank you!

It depends a lot on the distribution of data. In order of accuracy:

  • If you have all data in one place (1 index + 1 shard) the algos can see everything and it's all accurate
  • If you have data spread across a handful of shards and no bias in document routing (e.g. NO time-based indices) then each shard sees roughly the same randomised subset of the original data. The algos running on each shard should narrow in on the same set of top N terms they discover and return. Setting the shard_size value to a larger multiple of the size will increase the chances that all shards will contribute all the counts for the finally selected terms in the results fusion
  • If you have multiple shards and unevenly distributed data (e.g. using time based indices) then each shard has a biased view of the world and some questions may be impossible to answer at all. The worst case scenario is the "trending topics" question where you are asking what is special about today Vs all history and yet today's content is held on a machine away from the rest of history - the data is not physically organised in a way that allows the question to be answered.

To use an analogy - imagine shards are like continents in ancient times. You could ask each continent what they saw as the difference between, say adults and children and they'd each have enough local examples to draw valid and similar conclusions (weight, height, hair) etc. Now imagine asking continents what the difference is between eskimos and nomadic tribes of the Sahara. If you've never seen an eskimo you don't even know where to start. Physical distribution matters when it comes to this diffing.

How is your data distributed? How many shards? Using time-based indices? Using custom routing?

1 Like

Thank you, this is much more clear now!
We have 160 shards, data should be evenly distributed by date and
we don't use it to find anomalies per time-window.

Another consideration is that each shard will not volunteer background stats for all the terms that are not in your search results (there could be billions of terms unconnected to your search results). If your query matches only a few or no documents on a shard then it may not hit on terms that are found in other shards. That will give an incomplete figure.
160 shards sounds a lot - how many documents do you have in each shard?

I our cluster is used for various indexes, some of which huge, my case is smaller:
2M documents on 160 shards. each document represent a user, and containing a list of thousands events and other stats.
It's used to find anomalies at groups of users matching a multi term query which may include a match_phrase.

We use background_filter to compare the matched users to a subset of the entire population.

That seems a massive number of shards for 2m documents?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.