Count query (filter term) is returning inaccurate results - */_count
We have tried to get the count using a regular search query using this approach - "size":0,"track_total_hits":true and the count is still inaccurate.
Basically, the counts are the same in both the above queries.
In both the above cases, this is the _shards response - "_shards":{"total":1,"successful":1,"skipped":0,"failed":0}
We are running v7.13 on the Elastic Cloud. Our index is about ~35m documents. We are running a Staging & Production server on the Cloud (with the same data) and the counts are invalid on both the servers.
We don't do any real-time continuous updates to the index - no updates, no deletes. We add all the ~35m documents via bulk updates using background jobs initially, and no more updates after that.
How do we know the counts are inaccurate - well we were so confused by the inaccuracies, we looped through each document using scroll API to run a loop and do a manual count, and indeed our index/data looks ok, but Elasticsearch, unfortunately, is returning the wrong counts.
What does the output of the cat indices API look like? Have you refreshed after you finished bulk loading, especially if you optimized by changing the refresh interval during loading? What does the full queries and results look like?
What do you get if you run GET /<PUT INDEX PATTERN HERE>/_count?q=*:* ?
I vaguely recall there was some optimization introduced recently that improved performance by cutting the query short, which could result in inaccurate counts. Haven't found the blog post though yet, so could very well be mistaken. I would recommend trying an aggregation with a filter and see if that gives accurate results.
I wonder if the track_total_hits parameter mentioned here makes a difference. Can yoiu try setting this to true?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.