Whenever I restart my cluster completely, do a rolling restart or a node simply disconnects from the cluster due to a network hickup a lengthy recovery process is started that can take multiple hours to complete. All shards in the (restarting / recovering) node become unassigned and once the node is back online are reassigned or 'initialized' few at a time. The cluster state is yellow at this time because there remain enough replicas to keep up the service.
However, during this time, creating indices or updating mappings of existing indices often hit a 30s timeout with the following message:
Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
Our application updates the mappings of all indices on start up, meaning that if we ever need to restart both our ES cluster and application we have an outage of multiple hours!
We have the following cluster setup in one of our production environments:
- ElasticSearch 6.2.4
- 250 indices * 20 shards * 3 replicas = ~15k shards
- 3 dedicated master eligible nodes
- 8 data nodes \w 10 CPU cores, 36GB mem of which 12GB heap
- Spread over 4 hypervisors with local SSD storage
- 150M documents @ ~2 TB, including all replicas
- Index size varies between empty and 100GB
My cluster state size (compressed) is about 1MB, according to _cluster/state/master_node/
: "compressed_size_in_bytes": 846247
.
Indices are static and per-customer. Size of index varies greatly between customers. Index mapping are fairly complex with 100-1000 fields per index, including many nested fields. Dynamic index- and mapping creation is disabled. Each index has its own unique mapping maintained by our application.
I am not sure where these timeouts are coming from. When monitoring the load, everything seems well within normal and no single machine is completely overloaded. This is with nearly all settings on their default values.
I was able to tweak the recovery speed slightly using indices.recovery.max_bytes_per_sec=200mb
(up from 40mb) and cluster.routing.allocation.node_concurrent_recoveries=8
(up from 2) which does bring the entire cluster into very high load. This, however, does not seems to affect the timeouts on mapping updates.
What can I do to make sure that my cluster remains functional in its yellow state? Are there ways to reduce the time it need to recover? Why does it take so long to 'initialize' shards to the same node that already has them, but just disconnected or restarted for a few moments? Why won't the cluster process my update mapping and time out? What is actually keeping the cluster so busy at this time, as load on all nodes doesn't show anything shocking.