Hi,
I'm wondering if there is any reason why the background count of a significant terms aggregation would be inconsistent depending on the query filter (not a background filter).
I've run the following aggregations/queries:
Background count: 1033 (which matches the number in the entire index)
{'query': {'match_all': {}}, 'aggs': {'sig': {'significant_terms': {'field': 'tokens', 'include': ['describ']}}}}
Background count: 223
{'query': {'match': {'tokens': 'nucleotid'}}, 'aggs': {'sig': {'significant_terms': {'field': 'tokens', 'include': ['describ']}}}}
As you can see, in the second scenario, all I did was add a match query to the query. From my understanding, the background count should not change since it's the count on documents in the entire index. But there's a huge difference here. Is there a reason why this would happen? This index has a parent-child mapping where everything in the query is working on the children.
Curious. How many shards and indices are you querying across?
(Note also the significant_terms agg on a match_all does not make sense - it's asking "what is the difference between X and X?" because foreground and background sets are the same but I assume this query is only for the purpose of testing...)
Yes the query here is only for testing/demonstrating the problem.
I'm querying across 1 index with 6 primary shards, each with 1 replica, on a 3-node cluster.
I ran some quick tests and this problem does seem to be related to the number of shards and perhaps related to the number of nodes on my cluster. The background counts do stay consistent and correct in my test up to 3 primary shards (1 replica) on my 3-node cluster. Once I get to 4+ primary shards, the background counts seem to start getting quite inaccurate, at least relatively speaking on my small dataset. It also seems to be unrelated to the mapping of the fields and changing the shard_size on the aggregation does not seem to help either.
After running some more tests, it seems (although I'm not certain) that the problems tend to occur mostly when the query retrieves low numbers of documents (anywhere from 1-50 documents).
I suspect the issue is that background stats on a shard are only reported for terms that are present in the foreground set on that shard.
This is an optimisation to avoid always returning the background frequency of every term from every shard which would clearly be impractical in large systems. Arguably when the client provides include
clauses the scope of terms to be reported on is smaller and we could look to always provide background stats for the terms listed but include clauses can also be a regular expression that might expand to match many terms.
Significant_terms is primarily a tool for discovering new terms not reporting accurate counts for client-provided terms and the accuracy concerns are part of the documentation [1]. If, as in your example, you want to provide accurate counts for a single term there are perhaps better ways of doing this e.g.
GET ofac/_search
{
"query": {
"match": {
"_all": "farm"
}
},
"size": 0,
"aggs": {
"all": {
"global": {},
"aggs": {
"background": {
"terms": {
"field": "country.keyword",
"include": [
"Zimbabwe"
]
}
}
}
},
"foreground": {
"terms": {
"field": "country.keyword",
"include": [
"Zimbabwe"
]
}
}
}
}
If you are analysing modest volumes of relatively static data, using a single shard will always help avoid any distributed counting issues.
Cheers
Mark
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_approximate_counts
Thanks Mark. My concern is not actually about the background set counts per se since as you mentioned, there are many ways to get counts. My concern is that the significant terms aggregation ranks terms by various metrics which rely on the background counts. So when the background counts are significantly different, it turns out that the rankings and scores are significantly different as well, which seems to be a bigger problem if one wants to use this aggregation in a meaningful way for production.
It's a general issue with distributed analytics
I always say it comes down to a pick 2-of-3 conundrum between Fast Accurate and Big. It's about trade-offs.
- If you wanted BA you'd use something like Hadoop and stream all term frequencies in a series of map-reduce phases until you got your answer. Accurate, but takes a while and may end up being the answer to yesterday's question.
- We do BF which means we do things fast but often using sketches of data from shards which can lose accuracy
-
FA is the pre-big-data phase of computing when we had the luxury of doing everything on one machine.
So significant_terms is one of these analytical functions that can suffer with accuracy because of the distribution of the data. Perhaps a worst-case scenario is one where you have time-based indices and you query across them all to find the most significant IP address visiting your website today. Shards outside of the chosen time range will match nothing and so are not going to report background stats for every IP address that they have seen. The results are then not seeing the whole picture.
This is an extreme example where the physical distribution of the data is not conducive to the particular analysis you want to perform and so you would need to consider an alternative approach.
2 Likes