Hello! First post so apologies if i'm leaving something important out.
We are running into some problems weekly-ish where the heap utilization floats near 100% after a while, and eventually crashes Elaticsearch, requiring a kill -9 to end the process.
Cluster looks like this:
- 6 servers, 128GB memory per
- 3 nodes are dual master/data nodes
- 1 node is dual web / data node
- ES is allocated 31.something GB memory to get below the compressed Oops threshold
- 6 shards per index (one per server) plus 1 replica
- paging disabled
- Indexing 1-10 TB of information per day via bulk queues
- no plugins (I saw the post about a memory leak in the SSL plugin, not using that)
- Nothing else is running on the machines.
What we're seeing is eventually the heap gradually floats up to near 100%, a point where the process stops responding, forcing a restart of the node. I understand that generally the heap will gradually increase in usage as it keeps a cache around, but I can't imagine that it would max itself out like this.
Before a node craps out, we see logs like this:
[WARN ][monitor.jvm ] [gc][old][80118][4448] duration [27.6s], collections [1]/[27.8s], total [27.6s]/[11.1m], memory [30.7gb]->[30.6gb]/[31.9gb], all_pools {[young] [146.2mb]->[153.6mb]/[153.6mb]}{[survivor] [50.8mb]->[847.3kb]/[51.1mb]}{[old] [30.5gb]->[30.4gb]/[31.7gb]}
GC Old-generation spends nearly 30 seconds to reclaim...100 megs. This is while no active activity on the cluster such as a read or index operation.
Heap usage captured in Grafana:
The graph above shows the standing heap usage for a while, then a few nodes come back online after being restarted at much lower heap, then I restart the rest of the nodes, heap usage drops, then raises a little as the cluster rebalances.
Really unsure how to proceed and fix - about to start attaching jvm tooling to inspect the heap. Any help is greatly appreciated.