Hi,
I have a problem with coordinator node memory usage on one of our elasticsearch 5.6.3 clusters. We have 3 x coordinator nodes sitting in front of 6 data nodes. Normally, I size our coordinators with 8GB RAM and 5-6GB HEAP...
However, in this case I've been seeing oom errors and have raised heap size to 13GB and then 26GB and heap usage is sitting at over 90% most of the time. Can someone help me understand what is causing this and to fix it?
Workload is as follows:
client search requests from kibana (kibana can't get a response from the coordinators)
bulk indexing from multiple fluentd indexers running in 4 x kubernetes clusters
direct searches to the api (probably low)
We're ingesting about 600 million to 1 billion log lines per day with indices in the 250GB to 350GB range. The data nodes are under load and I'm planning to add more. But the coordinator node behaviour has me confused - they're normally pretty quiet. Any help would be greatly appreciated...
Hi @warkolm ,
I'd like to return to this topic if we could. I have installed x-pack and am still seeing coordinator nodes dying periodically by running out of heap. Am attaching screenshots to provide more context:
Our workload is logging to daily indices and the flow rate you see in the screenshots is about normal. The ingest nodes, atm, work as coordinator nodes only. They've also been allocated larger heaps but that hasn't resolved the problem. It's not clear exactly what is exhausting the heap or what steps we should take to remedy it...
@warkolm Note that I see bulk rejections in the log prior to heap exhaustion. Bulk queue size is 400 on the data nodes. Does x-pack graph bulk queue size and rejections? I don't see that...
@warkolm Am I right in thinking that coordinator nodes hold inbound bulks on heap before relaying them to the data nodes? So, as data nodes fall behind the heap will fill?
If i enable an ingestion pipeline on these nodes will also mean the ingest nodes maintain their own bulk queue? It appears to me that while the data nodes are rejecting, the coordinators are not and that may be the cause of the issue here?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.