Missing documents in scroll when there is GC on server

EtienneMiret · January 21, 2022, 3:18pm

When using the scroll API to retrieve large number of documents for batch processing, there are sometimes a number of documents missing. Usually a full page (exactly 10k). When looking at the timestamps of those failing queries, I notice they match time of high GC on elastic server.

My assumption is that one of the shards fails to process the query in a timely manner due to this GC, and because of this the shard handling the client connection replies only with documents of the other shards. Does this makes sense? And if so, after seeing this failure client side in the _shards element, how can I recover from it?

spinscale · January 25, 2022, 12:59pm

Can you be more specific if the documents are missing or if the response contained a failure? If the latter, can you provide a sample response?

Also, how long are those GCs?

Which Elasticsearch/JVM version are you ton?

EtienneMiret · January 25, 2022, 2:43pm

Sorry I wasn’t clear. So no, the response doesn’t contain any failure. More specifically, the Java client do not throw any exception when invoking ActionRequestBuilder.execute ().actionGet (). However, I never checked the number of successful shards in the response, nor whether it timed out. I just saw some hits are missing (hits.total from the first response is greater than the sum of the sizes of all hits.hits arrays).

The bug could of course be in the code that iterate over the results. But I couldn’t find any when proofreading it, plus the fact that this issue is correlated to GC peaks on the server made me consider the following hypothesis: when a shard is too long to reply, the client still gets a 200 response, with some hits missing and a non-zero _shards.failed. Is this possible?

According to our monitoring, 2 shards spent 20s each running GC in the 5 mins around the last time this issue occurred. I don’t know whether that’s one very long GC or 20 1s ones (same monitoring says there was 20 old GC in that timeframe).

Finally, we are sadly still on ES 1.7 (running on Java 1.8.0_131). The upgrade keeps being postponed.

spinscale · January 25, 2022, 4:14pm

Based on the information this sounds like a good assumption.

With newer Elasticsearch version, there is the Point-in-Time feature allowing to repeat searches Point in time API | Elasticsearch Guide [7.16] | Elastic

You could try and repeat the search with the same scroll search ID if there are shard failures (I do remember some problems with this in 1.7 but my memory is kinda blurred), so this may be worth a try.

EtienneMiret · January 25, 2022, 4:52pm

Thanks a lot! Seems I know have one more point to promote the upgrade

system · February 22, 2022, 4:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Scroll failed on some shards? Elasticsearch	2	338	June 16, 2022
Scroll API is dropping records Elasticsearch	9	272	July 29, 2024
Scroll randomly failing on some shards Elasticsearch	1	1410	March 7, 2018
Occasionally shards failing during scroll API (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280) Elasticsearch	5	518	June 21, 2024
Shard failure when scrolling - invalid results, but no error reported Elasticsearch	2	1736	July 6, 2017

Missing documents in scroll when there is GC on server

Related topics