Upper limits on replication and recovery heap usage

Hello,
I'm running into a strange situation with elasticsearch. I would like to estimate the Xmx needed for running elasticsearch data nodes while accounting for the heap needed during recovery.

All my documents are small (in KBs), and the writers which index the data into elasticsearch use small bulk requests (of a maximum 5000 documents).

However, when an index goes yellow for some time (due to one node going down and then coming back online), the recovery process sends large requests and starts tripping circuit breakers. The request sizes grow and the index never recovers to green.

I would like to budget for these large requests coming in from the recovery process which seem to be reserving several hundreds of MBs from the circuit breaker. Is there an upper limit to these? In the worst case would an entire shard be sent as a single bulk request across to the other node during replication?

Thank you.

failed to perform indices:data/write/bulk[s] on replica [xx][0], node[wy6bebBaQOC85iEi5vnJrA], [R], s[STARTED], a[id=ZxIAa9ZPRgiKxvT42TUCNg]" ,
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [elasticsearch-data-1][100.64.89.127:9300][indices:data/write/bulk[s][r]]",

CircuitBreakingException: [parent] Data too large, data for [<transport_request>] .. new bytes reserved: [454588504/433.5mb].

This isn't recovery traffic, it's a normal bulk indexing request. Recovery actions have names that start with internal:index/shard/recovery/

Recovery actions are limited in size to 512kB and there's no more than 5 of them in flight at once, so that's a max heap usage of under 3MB per recovery. There will be some amount overhead on top of that but it's certainly not several hundred MB.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.