Occasionally shards failing during scroll API (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280)

Thijsvdp · June 21, 2024, 9:38am

Hi,

I am facing some weird errors on our Elasticsearch cluster using scroll API. For some data pipeline that I created I need to use the scroll API. Everything worked fine, but recently I have been encountering the following types of errors:

Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.

My index is green, and all shards are green. So I am not entirely sure what could be causing this. I have a special setup that may be of interest for debugging this problem:

Resources

10 Nodes: 5 nodes belong to group 1 (g1), and 5 nodes belong to group 2 (g2)
For each node:
- 256GB RAM (32GB Heap)
- 64 vCPU

Development Index

Indexed on g1
10.7 TB (280 shards, primary only)
1.6b documents

Production Index

Indexed on g2
21.5 TB (280 primary shards and x1 replica)
1.6b documents

It is good to note that the data on the development index and the production index should be the same. We sometimes switch the development and production indices for maintenance purposes.

Recently the production index started to fail with the above scroll error (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.). The development index happily continues without a problem. Again, everything seems green and we do not have any issues with normal queries. It seems to be only with scroll.

I have investigated the segments on both of the indices and found the following:

There is about 2 times as many segments on our production index (counting only on primaries).
Development index has 9738 segments
Production index has 15049 segments

I would expect it to have roughly the same number of segments as Elasticsearch should automatically merge segments at some point. I created a histogram of the segment sizes, and they look fairly similar distributed in both indices:

Development index
segments_dev

Production index
segments_prod

Any suggestions or ideas of what is going on?

Christian_Dahlqvist · June 21, 2024, 9:53am

Whish version of Elasticsearch are you using?

Thijsvdp · June 21, 2024, 9:55am

@Christian_Dahlqvist I am using the latest version. Just upgraded a few days ago to 8.14.0.

Christian_Dahlqvist · June 21, 2024, 10:00am

Why are you using the scroll API instead of search after with PIT as recommended in the documentation?

Thijsvdp · June 21, 2024, 10:12am

I am using the .scan() method in the elasticsearch-py client. I did not realize that it was actually advised against to use scroll. I will implement it with PIT, and check if that helps. I also thought it was a matter of efficiency to not use scroll. Is there any particular reason to not use scroll?

Thijsvdp · June 21, 2024, 12:33pm

Thanks a lot. It did solve the issues!

Topic		Replies	Views
Scroll failed on some shards? Elasticsearch	2	338	June 16, 2022
Elasticsearch scroll randomly fails on some shards on version 7.6.2 Elasticsearch	2	1012	September 22, 2020
Scroll randomly failing on some shards Elasticsearch	1	1410	March 7, 2018
ScanError: scroll only succeeded on X out of X shards (python) Elasticsearch	1	4410	October 3, 2018
Issues with scan and scroll as well as count API Elasticsearch	5	1880	July 5, 2017

Occasionally shards failing during scroll API (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280)

Related topics