I had a 5 node cluster that went TU over the weekend (Tits Up).
AWS instance type: m4.2xlarge
vm.swappiness is set correctly (according to advice given here: https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html)
ES_MAX_HEAP is set to 50% the available RAM (so around 16GB on this instance type).
Take a look at the node:
The compare with the Used RAM gauge on my netdata dash taken at the same time:
So the system RAM usage is low, jvm heap usage is low, but
ram.percent is set at 99!
Thanks for any help offered!
What exactly happened? What is in the logs?
From what I can see a similar thing happened across all nodes:
- An agg from kibana attempted to load fielddata for a field that was just over the allowed limit (9.2GB over 8.9GB), so a circuit breaker kicked in.
- Within a minute all nodes entered a stop-the-world round of GC. Whilst GC is happening, inter-node communications break down.
- At least 3 of the nodes ran out of JVM heap space.
- As inter-node comms breaks down, data nodes lost track of the assigned master. At this point they start suppressing rest.
All of this happened before the above scenarios. When I finally got the cluster back together, this is what it looked like.
I started up a failover cluster, because the live cluster remains in a hobbled state (it won't assign shards from a snapshot restore to any node despite the fact that they all have capacity).
When restoring a snapshot to the failover cluster I got the same sort of stats as above:
- low heap.percent
- 99% ram.percent
- Low used RAM via netdata
So my question remains: What is
ram.percent? It's not an indication of system ram. It's not JVM heap usage. So what resource is it? And how can I increase it if I need to?
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.