Hi,
I am facing some weird errors on our Elasticsearch cluster using scroll
API. For some data pipeline that I created I need to use the scroll
API. Everything worked fine, but recently I have been encountering the following types of errors:
Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.
My index is green, and all shards are green. So I am not entirely sure what could be causing this. I have a special setup that may be of interest for debugging this problem:
Resources
- 10 Nodes: 5 nodes belong to group 1 (g1), and 5 nodes belong to group 2 (g2)
- For each node:
- 256GB RAM (32GB Heap)
- 64 vCPU
Development Index
- Indexed on g1
- 10.7 TB (280 shards, primary only)
- 1.6b documents
Production Index
- Indexed on g2
- 21.5 TB (280 primary shards and x1 replica)
- 1.6b documents
It is good to note that the data on the development index and the production index should be the same. We sometimes switch the development and production indices for maintenance purposes.
Recently the production index started to fail with the above scroll error (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280.
). The development index happily continues without a problem. Again, everything seems green and we do not have any issues with normal queries. It seems to be only with scroll
.
I have investigated the segments on both of the indices and found the following:
- There is about 2 times as many segments on our production index (counting only on primaries).
- Development index has 9738 segments
- Production index has 15049 segments
I would expect it to have roughly the same number of segments as Elasticsearch should automatically merge segments at some point. I created a histogram of the segment sizes, and they look fairly similar distributed in both indices:
Development index
Production index
Any suggestions or ideas of what is going on?