Regarding Coordinator Node Heap Usage

dawiro · October 11, 2018, 3:02pm

Hi,
I have a problem with coordinator node memory usage on one of our elasticsearch 5.6.3 clusters. We have 3 x coordinator nodes sitting in front of 6 data nodes. Normally, I size our coordinators with 8GB RAM and 5-6GB HEAP...
However, in this case I've been seeing oom errors and have raised heap size to 13GB and then 26GB and heap usage is sitting at over 90% most of the time. Can someone help me understand what is causing this and to fix it?

Workload is as follows:

client search requests from kibana (kibana can't get a response from the coordinators)
bulk indexing from multiple fluentd indexers running in 4 x kubernetes clusters
direct searches to the api (probably low)

We're ingesting about 600 million to 1 billion log lines per day with indices in the 250GB to 350GB range. The data nodes are under load and I'm planning to add more. But the coordinator node behaviour has me confused - they're normally pretty quiet. Any help would be greatly appreciated...

Regards,
D

warkolm · October 11, 2018, 8:36pm

You should install X-Pack and enable Monitoring to see what is happening.
Upgrading to 6.4.2 is also useful.

dawiro · October 11, 2018, 8:50pm

How does upgrading to 6.x from 5.x compare, relatively speaking, to upgrading to 5.x from 2.x?

warkolm · October 11, 2018, 8:51pm

It's much simpler, just make sure you are on the latest 5.X release and you can do a rolling upgrade

dawiro · October 11, 2018, 8:53pm

Is it paid or free x-pack that will give the insight you're referring to?

warkolm · October 11, 2018, 8:54pm

It's part of the Basic license, which is free.

dawiro · October 12, 2018, 7:49am

Why does this page state that a full cluster restart is required?

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/breaking-changes.html

dawiro · October 17, 2018, 7:21am

@warkolm, did you miss my comment above? Just looking for clarification if you could...

warkolm · October 17, 2018, 7:32am

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/rolling-upgrades.html is the one you want, it mentions you can do rolling from 5.6.X . to 6.X.

dawiro · October 18, 2018, 12:49pm

Ok thx. How should the x-pack plugin be treated when upgrading to 6.4.2? Uninstall before, or after?

nik9000 · October 18, 2018, 2:08pm

Uninstall before.

dawiro · October 26, 2018, 7:35am

Hi @warkolm ,
I'd like to return to this topic if we could. I have installed x-pack and am still seeing coordinator nodes dying periodically by running out of heap. Am attaching screenshots to provide more context:

Our workload is logging to daily indices and the flow rate you see in the screenshots is about normal. The ingest nodes, atm, work as coordinator nodes only. They've also been allocated larger heaps but that hasn't resolved the problem. It's not clear exactly what is exhausting the heap or what steps we should take to remedy it...

Regards,
D

warkolm · October 27, 2018, 12:35am

Can you just confirm what times the OOM occurs in the graphs?

dawiro · October 29, 2018, 11:39am

Here's the last line in the log from the evening prior to the day I posted:

java.lang.OutOfMemoryError: Java heap space
[2018-10-25T21:23:57,321][INFO ][o.e.m.j.JvmGcMonitorService] [ingest-001] [gc][old][138330][415] duration [18.1s], collections [3]/[9.8s], total [18.1s]/[14.3m], memory [13.6gb]->[13.6gb]/[13.6gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.2mb]->[16.6mb]/[16.6mb]}{[old] [13.5gb]->[13.5gb]/[13.5gb]}

dawiro · October 29, 2018, 11:43am

@warkolm Note that I see bulk rejections in the log prior to heap exhaustion. Bulk queue size is 400 on the data nodes. Does x-pack graph bulk queue size and rejections? I don't see that...

warkolm · October 29, 2018, 8:02pm

Oh, you're using a lot of Graph?

dawiro · October 30, 2018, 8:31am

@warkolm Am I right in thinking that coordinator nodes hold inbound bulks on heap before relaying them to the data nodes? So, as data nodes fall behind the heap will fill?

If i enable an ingestion pipeline on these nodes will also mean the ingest nodes maintain their own bulk queue? It appears to me that while the data nodes are rejecting, the coordinators are not and that may be the cause of the issue here?

warkolm · October 30, 2018, 8:48am

Yes.

I don't know enough of how that works to comment with any authority sorry

dawiro · October 30, 2018, 10:55am

@warkolm Could you see if you can get an answer on that? What is the recommendation here? Should we be sending bulks direct to the data nodes?

warkolm · October 30, 2018, 8:56pm

We'd suggest that, yes. The other main recommendation would be to not send them to master-only nodes.

Topic		Replies	Views
Regarding Coordinator Node Memory Usage Elasticsearch	2	1048	October 5, 2018
Kibana DevTools autocomplete refresh cause circuit breaker\ OOM on coordinators nodes Elasticsearch	8	536	July 12, 2021
How to optimize memory and heap usage for Single node Elastic Search Elasticsearch	7	871	November 20, 2020
High Heap Usage at Idle Elasticsearch	3	644	April 19, 2018
Heap usage causing node failure - 5.5.2 Elasticsearch	4	614	September 22, 2017

Regarding Coordinator Node Heap Usage

Related topics