We have a six-node ElasticSearch cluster, with a total of 6 nodes, 853 indexes, 5269 primary shards and 9bn documents in around 6.3TB of data.
After a node restart, the index recovery process takes around 15-30 minutes (depending on indexing load). However, what we are finding now is that for around 5-10 minutes during the rebuild process, ElasticSearch essentially becomes unusable.
For example, this is the Kibana data for one of our LogStash indexes for the most recent period. Note that during the actual period, Kibana was unusable.
Any index-based API request either times out, or returns a 503 error. Tailling the ElasticSearch log file of the hosts does not show anything (i.e. no new log entries during the problematic period). Parsing /_recovery
I can see that there were 261 shards recovered during the most recent period.
I have attempted experiment with indices.recovery
settings to either try and speed up the recovery, or slow it down to reduce system load, but the only thing that seems to achieve is enlarging or reducing the "hole" created by ElasticSearch being unresponsive.
System stats show nothing interesting. No CPU load, no excessive IOPS, no excessive network utilisation, nothing out of the ordinary at all.
I can't tell you if this is something related to 2.3 specifically or not. We did do a rolling upgrade from 2.1.1 to 2.3.4 two weeks ago, however we were not monitoring for this issue prior to the upgrade (so it may have been happening, or it may not, and tracking down a period of 10 minutes with no documents at a random time in the past is problematic).
In the next few weeks we intend on throwing a new indexing role to ElasticSearch which will insert around 1.5bn-2bn documents per week (two orders of magnitude more than we are indexing currently), and this will exacerbate that task.
Is anyone aware of this kind of behaviour from ElasticSearch? Is there anything I should specifically look at tuning, or some more logs I can dig out?