From what I can see a similar thing happened across all nodes:
An agg from kibana attempted to load fielddata for a field that was just over the allowed limit (9.2GB over 8.9GB), so a circuit breaker kicked in.
Within a minute all nodes entered a stop-the-world round of GC. Whilst GC is happening, inter-node communications break down.
At least 3 of the nodes ran out of JVM heap space.
As inter-node comms breaks down, data nodes lost track of the assigned master. At this point they start suppressing rest.
All of this happened before the above scenarios. When I finally got the cluster back together, this is what it looked like.
I started up a failover cluster, because the live cluster remains in a hobbled state (it won't assign shards from a snapshot restore to any node despite the fact that they all have capacity).
When restoring a snapshot to the failover cluster I got the same sort of stats as above:
low heap.percent
99% ram.percent
Low used RAM via netdata
So my question remains: What isram.percent? It's not an indication of system ram. It's not JVM heap usage. So what resource is it? And how can I increase it if I need to?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.