When using the scroll API to retrieve large number of documents for batch processing, there are sometimes a number of documents missing. Usually a full page (exactly 10k). When looking at the timestamps of those failing queries, I notice they match time of high GC on elastic server.
My assumption is that one of the shards fails to process the query in a timely manner due to this GC, and because of this the shard handling the client connection replies only with documents of the other shards. Does this makes sense? And if so, after seeing this failure client side in the _shards element, how can I recover from it?
Sorry I wasn’t clear. So no, the response doesn’t contain any failure. More specifically, the Java client do not throw any exception when invoking ActionRequestBuilder.execute ().actionGet (). However, I never checked the number of successful shards in the response, nor whether it timed out. I just saw some hits are missing (hits.total from the first response is greater than the sum of the sizes of all hits.hits arrays).
The bug could of course be in the code that iterate over the results. But I couldn’t find any when proofreading it, plus the fact that this issue is correlated to GC peaks on the server made me consider the following hypothesis: when a shard is too long to reply, the client still gets a 200 response, with some hits missing and a non-zero _shards.failed. Is this possible?
According to our monitoring, 2 shards spent 20s each running GC in the 5 mins around the last time this issue occurred. I don’t know whether that’s one very long GC or 20 1s ones (same monitoring says there was 20 old GC in that timeframe).
Finally, we are sadly still on ES 1.7 (running on Java 1.8.0_131). The upgrade keeps being postponed.
You could try and repeat the search with the same scroll search ID if there are shard failures (I do remember some problems with this in 1.7 but my memory is kinda blurred), so this may be worth a try.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.