Elasticsearch going down in cordinator node because of JVM heap space issue

Hi,
We have a 11 node ES cluster in AWS cloud, version ES 6.6.2, which is performing intensive bulk ingestion with very minimum search queries. PFB node configuration

Node	Type	Instance type	JVM configured
1	Master, Data, Ingest	i3.2xlarge	30g
2	Master, Data, Ingest	i3.2xlarge	30g
3	Master, Data, Ingest	i3.2xlarge	30g
4	Data, Ingest	i3.2xlarge	30g
5	Data, Ingest	i3.2xlarge	30g
6	Data, Ingest	i3.2xlarge	30g
7	Data, Ingest	i3.2xlarge	30g
8	Data, Ingest	i3.2xlarge	30g
9	Data, Ingest	i3.2xlarge	30g
10	Data, Ingest	i3.2xlarge	30g
11	Cordinator	m5.large	5g

Elasticsearch service only in the co-ordinator node is failing frequently with Java Heap space out of memory issue as mentioned in the below logs

[2019-12-23T23:56:01,923][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-10-13-39-98.ec2.internal] fatal error in thread [Thread-5], exiting
java.lang.OutOfMemoryError: Java heap space
        at io.netty.buffer.UnpooledHeapByteBuf.allocateArray(UnpooledHeapByteBuf.java:88) ~[?:?]
        at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledHeapByteBuf.allocateArray(UnpooledByteBufAllocator.java:164) ~[?:?]
        at io.netty.buffer.UnpooledHeapByteBuf.<init>(UnpooledHeapByteBuf.java:61) ~[?:?]
        at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledHeapByteBuf.<init>(UnpooledByteBufAllocator.java:159) ~[?:?]
        at io.netty.buffer.UnpooledByteBufAllocator.newHeapBuffer(UnpooledByteBufAllocator.java:82) ~[?:?]
        at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:166) ~[?:?]
        at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:157) ~[?:?]
        at io.netty.buffer.Unpooled.buffer(Unpooled.java:116) ~[?:?]
        at io.netty.buffer.Unpooled.copiedBuffer(Unpooled.java:409) ~[?:?]
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:73) ~[?:?]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:68) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[?:?]
        at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) ~[?:?]

PFB the heap dump analysis

Problem_suspect_2

Could you help us identifying what can be the root cause of the issue and how can this be mitigated. We cannot compromise on the ingestion rate. Initially the JVM heap configure for the cordinator node was 4g and we have increased it to 5g and it is not solving the problem.

Please let me know if you need more details

May be you are sending too many and too big bulk requests to the coordinating node? So increase the JVM heap to something more important like 8gb or 16gb? May be you need to ask AWS support? AWS Service is not managed by us.

We do have this official offer https://www.elastic.co/cloud and https://aws.amazon.com/marketplace/pp/B01N6YCISK. It runs also on AWS and is always up to date with all the latest improvements.

Cloud by elastic is one way to have access to all features, all managed by us. Think about what is there yet like Security, Monitoring, Reporting, SQL, Canvas, APM, Logs UI, Infra UI, SIEM, Maps UI and what is coming next :slight_smile: ...

Hi David,
We are not sending any bulk request to cordinator node. We are using python ES client with sniffing of the 10 nodes(Data,Ingest and Master) for bulk ingestion. Coordinator node is not configured anywhere for receiving the bulk request. For the ingest,data and master node we have 30g configured as the heap space and everything if going good in those nodes

What are you using the Coordinator node for?

We are using it for hosting Kibana for monitoring the entire cluster.
We are triggering snapshots and may be some delete using this coordinator node

If you have sniffing enabled I believe the client will find the coordinating node and send data there. Do you see any connections from the python clients to this node?

is there any query by which I can check and confirm whether there is any connections established to this node?

Use a tool like netstat to see where your clients are connecting.

Hi @Stephy_Jacob

for monitoring the entire cluster

You don't by any chance send you monitoring data to that node (that would be a bulk request as well and a potentially very sizeable)?

Other than that, maybe just check the tasks API for that coordinator node and see what you find. That should pending bulk ingestions afaik. Feel free to paste the output for the node here for us to take a look, maybe it's something other than bulk ingestion that we can spot from the tasks after all.

As a side note, the likely OOM issue of failing due to large bulk HTTP requests in some form is way improved upon in upcoming ES 7.6 where we removed the allocation that is failing here (see https://github.com/elastic/elasticsearch/pull/44564 if interested).

H @Armin_Braun,
We checked the tasks api on the cordinator node ip and got to know there are some bulk write even though we have not configured this ip(cordinator node) for sniffing for bulk ingestion.

We see there direct bulk write requests and rerouted write request. Could you please explain why there is bulk write requests and rerouted write request hitting this node when this node is configured only as cordinator node with no shards in it.

PS: We don't have a load balancer concept implemented. We have configured all the data nodes only for sniffing to get an ES connection and that connection is used to ingest with bulk api

PFB a snippet of tasks api output. The entire tasks api has many similar task as mentioned in the below snippet

GET _tasks?nodes=ip-10-xx-xx-xxx.ec2.internal&human&actions=indices:*&detailed

"gl4L2YENT8GE1kLbD9EF5Q:861742" : {
"node" : "gl4L2YENT8GE1kLbD9EF5Q",
"id" : 861742,
"type" : "transport",
"action" : "indices:data/write/bulk",
"description" : "requests[1000], indices[tgs_xxxxxx_xxxx_c2-5-000086]",
"start_time" : "2019-12-26T05:57:57.185Z",
"start_time_in_millis" : 1577339877185,
"running_time" : "2.1s",
"running_time_in_nanos" : 2145269865,
"cancellable" : false,
"parent_task_id" : "gl4L2YENT8GE1kLbD9EF5Q:566962",
"headers" : { }
},
"gl4L2YENT8GE1kLbD9EF5Q:861737" : {
"node" : "gl4L2YENT8GE1kLbD9EF5Q",
"id" : 861737,
"type" : "transport",
"action" : "indices:data/write/bulk[s]",
"status" : {
"phase" : "rerouted"
},
"description" : "requests[333], index[tgs_xxxxxx_xxxx_c2-5-000092]",
"start_time" : "2019-12-26T05:57:57.120Z",
"start_time_in_millis" : 1577339877120,
"running_time" : "2.2s",
"running_time_in_nanos" : 2209866093,
"cancellable" : false,
"parent_task_id" : "gl4L2YENT8GE1kLbD9EF5Q:861736",
"headers" : { }
},

There are no actual writes executing on the coordinator Think. All that's happening is that the coordinator is hit with write requests that are then rerouted. That seems to be a problem with the clients instead of ES itself.

I'm afraid that's all the information I can extract from the above.

Can you try disabling sniffing as you know the exact nodes you want to connect to? If the coordinating only node is only meant to serve a local Kibana instance, you can also make it bind to licalhost.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.