Regarding Coordinator Node Heap Usage

Hi,
I have a problem with coordinator node memory usage on one of our elasticsearch 5.6.3 clusters. We have 3 x coordinator nodes sitting in front of 6 data nodes. Normally, I size our coordinators with 8GB RAM and 5-6GB HEAP...
However, in this case I've been seeing oom errors and have raised heap size to 13GB and then 26GB and heap usage is sitting at over 90% most of the time. Can someone help me understand what is causing this and to fix it?

Workload is as follows:

  • client search requests from kibana (kibana can't get a response from the coordinators)
  • bulk indexing from multiple fluentd indexers running in 4 x kubernetes clusters
  • direct searches to the api (probably low)

We're ingesting about 600 million to 1 billion log lines per day with indices in the 250GB to 350GB range. The data nodes are under load and I'm planning to add more. But the coordinator node behaviour has me confused - they're normally pretty quiet. Any help would be greatly appreciated...

Regards,
D

You should install X-Pack and enable Monitoring to see what is happening.
Upgrading to 6.4.2 is also useful.

How does upgrading to 6.x from 5.x compare, relatively speaking, to upgrading to 5.x from 2.x?

It's much simpler, just make sure you are on the latest 5.X release and you can do a rolling upgrade :slight_smile:

Is it paid or free x-pack that will give the insight you're referring to?

It's part of the Basic license, which is free.

Why does this page state that a full cluster restart is required?

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/breaking-changes.html

@warkolm, did you miss my comment above? Just looking for clarification if you could...

https://www.elastic.co/guide/en/elasticsearch/reference/6.0/rolling-upgrades.html is the one you want, it mentions you can do rolling from 5.6.X . to 6.X.

Ok thx. How should the x-pack plugin be treated when upgrading to 6.4.2? Uninstall before, or after?

Uninstall before.

Hi @warkolm ,
I'd like to return to this topic if we could. I have installed x-pack and am still seeing coordinator nodes dying periodically by running out of heap. Am attaching screenshots to provide more context:


09
32

Our workload is logging to daily indices and the flow rate you see in the screenshots is about normal. The ingest nodes, atm, work as coordinator nodes only. They've also been allocated larger heaps but that hasn't resolved the problem. It's not clear exactly what is exhausting the heap or what steps we should take to remedy it...

Regards,
D

Can you just confirm what times the OOM occurs in the graphs?

Here's the last line in the log from the evening prior to the day I posted:

java.lang.OutOfMemoryError: Java heap space
[2018-10-25T21:23:57,321][INFO ][o.e.m.j.JvmGcMonitorService] [ingest-001] [gc][old][138330][415] duration [18.1s], collections [3]/[9.8s], total [18.1s]/[14.3m], memory [13.6gb]->[13.6gb]/[13.6gb], all_pools {[young] [133.1mb]->[133.1mb]/[133.1mb]}{[survivor] [16.2mb]->[16.6mb]/[16.6mb]}{[old] [13.5gb]->[13.5gb]/[13.5gb]}

@warkolm Note that I see bulk rejections in the log prior to heap exhaustion. Bulk queue size is 400 on the data nodes. Does x-pack graph bulk queue size and rejections? I don't see that...

Oh, you're using a lot of Graph?

@warkolm Am I right in thinking that coordinator nodes hold inbound bulks on heap before relaying them to the data nodes? So, as data nodes fall behind the heap will fill?

If i enable an ingestion pipeline on these nodes will also mean the ingest nodes maintain their own bulk queue? It appears to me that while the data nodes are rejecting, the coordinators are not and that may be the cause of the issue here?

Yes.

I don't know enough of how that works to comment with any authority sorry :frowning:

@warkolm Could you see if you can get an answer on that? What is the recommendation here? Should we be sending bulks direct to the data nodes?

We'd suggest that, yes. The other main recommendation would be to not send them to master-only nodes.