Cluster crash after exceeding cluster.max_shards_per_node

With the introduction of the cluster.max_shards_per_node setting, we found on upgrading 6.8->7.x that we had to prune a lot of old indices to bring our shard count down. At the weekend, we had a cluster outage (QA, thankfully) as new indices tipped the count over 1000/n again:

[2019-07-28T20:00:00,134][ERROR][c.f.s.a.s.InternalESSink ] [esXXX1] Unable to index audit log { ... } due to
java.lang.IllegalArgumentException: Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [6000]/[6000] maximum shards open;

(EDT timezone - 20:00:00 being 00:00:00Z when date-based index rollovers would be creating a bunch of new shards)

Over the course of several hours into the next day, all 6 data nodes actually crashed, even though only WARN level and lower were logged - the last things in all the data node logs were either this maximum shards warning or familiar gc INFO lines. By the time we noticed and investigated, no java service processes were running. Subsequently the master nodes were trying to field requests and logging that there were no ingest nodes available. As the intent of the limit is to "prevent operations which may unintentionally destabilize the cluster" this doesn't seem like a good failure mode.

Increasing the limit (for now - until we determine how to reduce our active shard usage) let us restart, but recovery was slow, and stalled with 1000-odd unallocated shards. We had to issue /_cluster/reroute?retry_failed=true to get the failed reallocations to retry. I found that in some blog post for a different problem, not sure if it's best practice; couldn't find it documented as a typical recovery step.

Any thoughts on how to make this more robust, or identify why the data nodes crashed instead of continuing in the expected degraded but more easily recoverable state?

Also: would it be reasonable to add the per-node shard count to the /_node/stats output for easier consumption by external monitoring/alerting? For now I've added a collectd curl_json collector against /_cat/allocation?s=node&format=json but that has some annoying limitations (it can't parse out the node names so we just get array indices).

Elasticsearch logs an error and refuses to create any more indices when you already have an unreasonably large number of shards, but that won't cause any processes to exit. Therefore I think that the fact that the last few messages before stopping were about cluster.max_shards_per_node is a distraction, and Elasticsearch didn't actually manage to log any messages about why it was shutting down. The usual reason for this is the OOM killer, and the usual reason for invoking that is setting the maximum heap size to be greater than 50% of the total memory on the machine. Even 50% is pretty close to the wire TBH, although it's a bit more resilient in 7.2.0; you don't say which version exactly you're using but it's always good to share that.

It is unusual indeed to need to use POST /_cluster/reroute?retry_failed to retry allocations after a restart. If a shard is unexpectedly unallocated you would be better off using the allocation explain API to work out why.

*Sigh* There it is.

Jul 29 00:04:44 esxxx4 kernel: Killed process 32263 (java) total-vm:28861588kB, anon-rss:9446536kB, file-rss:0kB, shmem-rss:0kB

10GB host memory and -Xmx6g. We did have that carefully tuned to the host size, knowing full well the heap size requirements; looks like somewhere along the line it's been updated incorrectly when more memory was added to the host. (The production servers happily are still tuned down below 50%.)

I did initially look at the allocation explain API to try and diagnose the problem. I don't have the full output saved, but the per-node allocation decisions given were spread over these two forms, which didn't give much more to go on:

the shard cannot be allocated to the same node on which a copy of the shard already exists [[sg6-auditlog-2019.02.20][1]  ...
shard has exceeded the maximum number of retries [5] on failed allocation attempts 

("why did it fail?" "because it failed too many times") Is there a way to get more useful detail? What might I search for in node logs?

Notwithstanding the above:

  1. This is the first OOM case we've seen since before any memory sizing change (1+ year) and we've had a lot more shards in that time than we currently have; this situation where the newly defined shard limit cuts in would seem to be creating a memory consumption issue. Our host memory monitoring in Grafana shows an unusual sharp increase in used from 00:00Z until node crash, on each node. I'd say that's still not "preventing destabilization of the cluster".
    image
  2. Any thoughts on the previous comment about adding node shard counts to the stats API?

This is Elasticsearch 7.1.1 on CentOS 7.4.

The allocation explain API should also be reporting the last error, including the time at which it occurred so that you can correlate it with other messages in your logs.

The fundamental issue here was that the heap size is set too high. Elasticsearch has other mechanisms for preventing overload too, but these are largely dependent on the assumption that the heap size is set correctly. Also I recommend upgrading to 7.2.1 since the default memory configuration in 7.2.x is now less likely to trigger the OOM killer than it was in earlier versions.

1000 shards per node is a very very conservative limit. With a 5GB heap you really should be aiming for more like 100 shards per node, given the rule of thumb of 20-shards-per-GB-of-heap:

Not really, no, it sounds feasible but you can also derive it from stats that are already exposed. I think it wouldn't have helped here, because the 1000-shards-per-node limit isn't enforced on a node-by-node basis: it's a cluster-wide limit that's simply calculated as number-of-data-nodes*1000.