Occasionally shards failing during scroll API (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280)

Hi,

I am facing some weird errors on our Elasticsearch cluster using scroll API. For some data pipeline that I created I need to use the scroll API. Everything worked fine, but recently I have been encountering the following types of errors:

Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.

My index is green, and all shards are green. So I am not entirely sure what could be causing this. I have a special setup that may be of interest for debugging this problem:

Resources

  • 10 Nodes: 5 nodes belong to group 1 (g1), and 5 nodes belong to group 2 (g2)
  • For each node:
    • 256GB RAM (32GB Heap)
    • 64 vCPU

Development Index

  • Indexed on g1
  • 10.7 TB (280 shards, primary only)
  • 1.6b documents

Production Index

  • Indexed on g2
  • 21.5 TB (280 primary shards and x1 replica)
  • 1.6b documents

It is good to note that the data on the development index and the production index should be the same. We sometimes switch the development and production indices for maintenance purposes.

Recently the production index started to fail with the above scroll error (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.). The development index happily continues without a problem. Again, everything seems green and we do not have any issues with normal queries. It seems to be only with scroll.

I have investigated the segments on both of the indices and found the following:

  • There is about 2 times as many segments on our production index (counting only on primaries).
  • Development index has 9738 segments
  • Production index has 15049 segments

I would expect it to have roughly the same number of segments as Elasticsearch should automatically merge segments at some point. I created a histogram of the segment sizes, and they look fairly similar distributed in both indices:

Development index
segments_dev

Production index
segments_prod

Any suggestions or ideas of what is going on?

Whish version of Elasticsearch are you using?

@Christian_Dahlqvist I am using the latest version. Just upgraded a few days ago to 8.14.0.

Why are you using the scroll API instead of search after with PIT as recommended in the documentation?

1 Like

I am using the .scan() method in the elasticsearch-py client. I did not realize that it was actually advised against to use scroll. I will implement it with PIT, and check if that helps. I also thought it was a matter of efficiency to not use scroll. Is there any particular reason to not use scroll?

Thanks a lot. It did solve the issues!