In elasticsearch v5/6 - there is a field on a search response hits.total. Is this field intended to be completely accurate (i.e. with a healthy cluster + no changing data should we always be able to retrieve exactly that number of hits with the relevant query)?
I can see that in version 7 of elasticsearch there is a new feature/option (track total hits) which can be enabled and may change the accuracy of the total hit count for performance reasons (but will tell you if you have received an accurate count or a lower bound). My inference from this is that pre v7, the count is intended to be reliable and exact, but I can't find this explicitly stated anywhere.
Context
We have an application which issues a terms search to elastic (querying some keywords fields). Our application has to retrieve every single hit for the specified search, so we do this by:
Opening a scroll for the search (noting the hits.total returned)
Requesting X hits from the scroll until we have retrieved exactly the number in hits.total
Recently we have experienced a cluster going into a state where the hits.total number does not line up with the number of hits we can retrieve. In these instances, restarting the nodes fixed the problem. What we'd like to know is if this is a valid thing to happen which we need to try and deal with in our application, or if it's an error (some bad caching or corrupted cluster) which we should can just detect and throw an error for. I found one open issue on github which looks it could be what we're experiencing, except it is relating to an ids query specifically.
Hi Will,
Is your client code checking for any failed shards in each of the scroll results?
The hits.total should be accurate and the scroll API should preserve a point-in-time view of the index for the duration of the scroll which is unaffected by any concurrent updating.
Thanks for the response - yes we return an error if there are failed shards at any point. For the moment we are just detecting when the numbers don't line up and raising an error since we need be sure we're returning all the hits.
Someone here has made a start at trying to replicate / diagnose this and has had some success with inducing the issue on a test cluster of 10 nodes. They loaded an index with replicas = 1 and then started executing lots of scroll requests in parallel. When a node is removed from the cluster, the issue with hits.total seems to begin happening and eventually goes away again. Will keep you updated if we find anything else.
A scroll will not fail-over to other replicas in the event of node loss. Part of the information in the scroll context relates to positions in the segments of a physical Lucene index. Replicas are intended to have the same set of documents but organise them independently into different physical Lucene segment files according to local merge operations - the files on-disk are not the same. This is why the current scroll offset in one replica is not translatable to that of another replica in the event of a failure.
That's interesting and makes sense given the behaviour of scroll. I think this explains half of what we were seeing, except the issue seemed to persist - e.g. we'd be opening a completely new scroll request and that one was hitting into the same issue. There's a possibility this is because before we added error handling for this issue, the scroll requests would keep spinning trying to find the remaining results and this prevented the cluster from properly repairing.
The other thing we noticed when this first happened was that we tried closing all open scrolls using DELETE /_search/scroll/_all and we got an acknowledgement back but could still see hundreds of open scrolls on the cluster afterwards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.