I have a program that runs above query to pull records via scroll api. But recently I started noticing that some times count pulled by the program is less than the number that I will get if I run the same query directly on the ES index. This is not a consistent behaviour and is observed sporadically. I have seen GC running while the program is running and records are lost but noticed that the behaviour is not consistent and not all the time GC is running records are being dropped. Anyone aware of any particular scenario in which scroll api will drop records?
Is the index receiving new documents? Per the Scroll API docs:
Scroll API reflects the state of the index at the time of the initial search request. Subsequent indexing or document changes only affect later search and scroll requests.
How many results is your query returning? It is possible that you need to track total hits to ensure they are being counted accurately if you have more than 10,000 results.
hello @Carlos_D
Yes we are able to index new documents. Also this issue happens when the number of records(Hits.Hits) are less than 10,000 as well as more than 10,000 and in fact I have seen that scroll is returning 0 results in some cases which makes me think that this issue is not related to scroll api itself.
Note: I am unable to replicate this behaviour consistently in the environment and hence I am suspecting if any background process is running at certain time periods which is causing this issue. On that front I tried to check details like indexing time, merge time, fetch time etc but nothing seems to be out of the ordinary.
If the index is not read-only, it is expected that you have different results using scroll and doing a separate query for the reason mentioned - scroll will keep the view of the index as was when the scrolling started.
Are you getting search failures, as in shard failures for your requests? That could explain the difference as well.
hello @Carlos_D, thank you for your suggestions.
I do read and write on the same index. Agree that any results getting inserted at the time of read will not be available in scroll but in my case records that were inserted more than 1 hr ago are missing in the search results and I have not changed the default refresh interval.
Also I have checked the elasticsearch logs as well and I do not see any errors related to shards in the logs for the index that is giving me issue (though i do see some issues with shards for other indices)
The version you are using is very old. I would recommend upgrading to at least 7.17 and start using search after with a point-in-time as recommended for consistency in the docs.
@Christian_Dahlqvist , we are also facing the same issue of records getting dropped with scroll API. We are able to consistently reproduce the issue when a sort on date/float column is added in the scroll API request.
Thank you @Christian_Dahlqvist . We will try using the PIT. Any pointers on what might be causing this issue? As this is happening only when we use the sort with date or number column.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.