ES 2.3.4 unresponsive during index recovery

We have a six-node ElasticSearch cluster, with a total of 6 nodes, 853 indexes, 5269 primary shards and 9bn documents in around 6.3TB of data.

After a node restart, the index recovery process takes around 15-30 minutes (depending on indexing load). However, what we are finding now is that for around 5-10 minutes during the rebuild process, ElasticSearch essentially becomes unusable.

For example, this is the Kibana data for one of our LogStash indexes for the most recent period. Note that during the actual period, Kibana was unusable.

Any index-based API request either times out, or returns a 503 error. Tailling the ElasticSearch log file of the hosts does not show anything (i.e. no new log entries during the problematic period). Parsing /_recovery I can see that there were 261 shards recovered during the most recent period.

I have attempted experiment with indices.recovery settings to either try and speed up the recovery, or slow it down to reduce system load, but the only thing that seems to achieve is enlarging or reducing the "hole" created by ElasticSearch being unresponsive.

System stats show nothing interesting. No CPU load, no excessive IOPS, no excessive network utilisation, nothing out of the ordinary at all.

I can't tell you if this is something related to 2.3 specifically or not. We did do a rolling upgrade from 2.1.1 to 2.3.4 two weeks ago, however we were not monitoring for this issue prior to the upgrade (so it may have been happening, or it may not, and tracking down a period of 10 minutes with no documents at a random time in the past is problematic).

In the next few weeks we intend on throwing a new indexing role to ElasticSearch which will insert around 1.5bn-2bn documents per week (two orders of magnitude more than we are indexing currently), and this will exacerbate that task.

Is anyone aware of this kind of behaviour from ElasticSearch? Is there anything I should specifically look at tuning, or some more logs I can dig out?

What specs are your nodes?

That seems way too much, you should look at reducing this.

Thanks for your response. The six servers are identical, and are as follows:

CPU: 2x Xeon E5-2660 v2 (2.2Ghz, 10 core, 20 threads)
Memory: 192GB DDR-3 (1600MHz)
Data drives: 3x RAID-0 arrays consisting of 4x 4TB Western Digital Black (total of 48TB per server).
OS drives: RAID-1 array consisting of 2x 300Gb SAS disks
Network: 2x 10Gbps interfaces

As for the primary shards, the reason it's so high is that we have one primary shard for each host, and multiple daily indexes. We have two LogStash indexes broken by day, and soon we're going to have a third daily index that will dwarf the existing two in size - around 200-300 million documents per day.

Data is going back around 8 months (we lost a bunch of data going from 1.x to 2.x in January which we did not re-index, but we don't want to purge this data again if we can help it).

Suggestions on how we should reduce our shard count are welcome, but ultimately all we would be doing is putting the same amount of data into fewer shards, which would reduce the shard count, but increase the size of each individual shard.

Can you offer some insight into how reducing the shard count might help us with this issue? I will certainly attempt this though - I'll perhaps try rolling up some of our older daily indexes into weekly indexes.

Shards take certain resources just to exist and be managed (and thus be recovered), so reducing shard count helps keep resource use under control.
Obviously the larger the shard the more it takes to recover, but having lots off small shards wastes resources.

Yep!