Missing documents in scroll when there is GC on server

When using the scroll API to retrieve large number of documents for batch processing, there are sometimes a number of documents missing. Usually a full page (exactly 10k). When looking at the timestamps of those failing queries, I notice they match time of high GC on elastic server.

My assumption is that one of the shards fails to process the query in a timely manner due to this GC, and because of this the shard handling the client connection replies only with documents of the other shards. Does this makes sense? And if so, after seeing this failure client side in the _shards element, how can I recover from it?

Can you be more specific if the documents are missing or if the response contained a failure? If the latter, can you provide a sample response?

Also, how long are those GCs?

Which Elasticsearch/JVM version are you ton?

Sorry I wasn’t clear. So no, the response doesn’t contain any failure. More specifically, the Java client do not throw any exception when invoking ActionRequestBuilder.execute ().actionGet (). However, I never checked the number of successful shards in the response, nor whether it timed out. I just saw some hits are missing (hits.total from the first response is greater than the sum of the sizes of all hits.hits arrays).

The bug could of course be in the code that iterate over the results. But I couldn’t find any when proofreading it, plus the fact that this issue is correlated to GC peaks on the server made me consider the following hypothesis: when a shard is too long to reply, the client still gets a 200 response, with some hits missing and a non-zero _shards.failed. Is this possible?

According to our monitoring, 2 shards spent 20s each running GC in the 5 mins around the last time this issue occurred. I don’t know whether that’s one very long GC or 20 1s ones (same monitoring says there was 20 old GC in that timeframe).

Finally, we are sadly still on ES 1.7 (running on Java 1.8.0_131). The upgrade keeps being postponed.

Based on the information this sounds like a good assumption.

With newer Elasticsearch version, there is the Point-in-Time feature allowing to repeat searches Point in time API | Elasticsearch Guide [7.16] | Elastic

You could try and repeat the search with the same scroll search ID if there are shard failures (I do remember some problems with this in 1.7 but my memory is kinda blurred), so this may be worth a try.

Thanks a lot! Seems I know have one more point to promote the upgrade :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.