Node goes down showing fatal error in network layer , thread and java heapspace error

Sourabh · July 4, 2019, 7:04am

I am having a cluster with 5 master nodes,12 coordinator nodes and 60 data nodes .Currently i am doing heavy indexing in this es cluster around 15 billion documents spread through the day.We have 3 index which are undergoing heavy indexing there are four rollover in a day for each indexes.Each indexes having 100 shards and the replica is set to 1.The nodes are up on a physical server having 200gb of RAM ,each nodes have around 32gb of heap and the translog durability is set to async.

I am getting the following error and the node goes down.

[2019-07-03T21:28:59,331][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)

[2019-07-03T21:28:59,336][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_8] fatal error in thread [Thread-34032], exiting
java.lang.OutOfMemoryError: Java heap space

[2019-07-03T21:28:59,397][WARN ][o.e.t.n.Netty4Transport  ] [data_8] exception caught on transport layer [[id: 0x07b04e0b, L:/56.241.23.137:9303 - R:/56.241.23.147:35014]], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space

Please suggest .This is a very trivial issue we are facing.

kumarpiyush · July 4, 2019, 7:12am

I am having same issue.
Whenever a heavy query is fired on my cluster node timeout exception occur and sometime node goes down.

Hoping to hear you soon.

Christian_Dahlqvist · July 4, 2019, 7:21am

There have been several posts about a very similar cluster over the last few days and suggestions have been given. Have any of these made any difference? If the problem is trivial, why has it not been resolved?

Sourabh · July 4, 2019, 7:31am

Hi @Christian_Dahlqvist

I have tried those suggestions but this haven't made any difference for me.

dadoonet · July 4, 2019, 8:45am

Are you working with @Sourabh?

kumarpiyush · July 4, 2019, 8:54am

No @dadoonet .
But going through his errors and similar posts I found out that we are doing same thing but at slightly less scale and slightly smaller cluster as compared to @Sourabh . But I am finding the same issue.

dadoonet · July 4, 2019, 2:18pm

So it's better to open your own question and give the details of your cluster such as:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

Christian_Dahlqvist · July 4, 2019, 6:29pm

I did suggest decreasing the number of primary shards, but you still report having 100. Did you add any nodes to increase the available amount of heap? I thought you had 60 data nodes before I suggested adding more.

Exactly which suggestions did you try and what was the effect?

Sourabh · July 11, 2019, 5:45am

I have decreased the number of primary shards to 60 but still i am getting the following error.

[2019-07-11T07:54:17,148][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_15] fatal error in thread [Thread-142209], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
... 6 more
SymbolTable statistics:
Number of buckets : 20011 = 160088 bytes, avg 8.000
Number of entries : 148658 = 3567792 bytes, avg 24.000
Number of literals : 148658 = 9189616 bytes, avg 61.817
Total footprint : = 12917496 bytes
Average bucket size : 7.429
Variance of bucket size : 7.497
Std. dev. of bucket size: 2.738
Maximum bucket size : 20
StringTable statistics:
Number of buckets : 500009 = 4000072 bytes, avg 8.000
Number of entries : 23025 = 552600 bytes, avg 24.000
Number of literals : 23025 = 3138896 bytes, avg 136.326
Total footprint : = 7691568 bytes
Average bucket size : 0.046
Variance of bucket size : 0.046
Std. dev. of bucket size: 0.215
Maximum bucket size : 3

Christian_Dahlqvist · July 11, 2019, 6:57am

Did you increase the number of data nodes and thus the total amount of heap available to the cluster?

Sourabh · July 11, 2019, 7:15am

Due to resource restriction i can't increase the number of data nodes but i have reduced the number of shards to 60.The amount of heap available to each datanode is around 32gb.

Christian_Dahlqvist · July 11, 2019, 7:26am

A typical data node has 64GB RAM out of which ~30GB is allocated for heap. If your hosts have 200GB RAM you should be able to run 3 data nodes per host. This will naturally reduce the size of the OS page cache.

Sourabh · July 12, 2019, 7:17am

The server that has has 200gb of ram has low storage capacity if increase the number of datanodes then there will be lot of reallocation of shards across datanodes.

Christian_Dahlqvist · July 12, 2019, 7:33am

How much RAM do the nodes with large storage capacity have? If you could highlight the hardware profile of the different node types it would be easier to provide guidance.

jonathan_rowe · July 12, 2019, 2:32pm

don't set a 32gb heap, 64-bit pointers will kick in, stick to <30gb

Sourabh · July 12, 2019, 3:16pm

What does 64-bit pointer mean?

And everyday one or more node goes down with heap space error.And in the logs i see old gc.

Sourabh · July 12, 2019, 3:21pm

And in logs today i saw new IndexWriter is closed exception.

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

When i run free -g command in the servers where i am running the nodes sometimes the there is no free space available all the free space is in the buffer.I have to free the cache using the command.

jonathan_rowe · July 13, 2019, 9:08am

It means that object references on the heap are 8 bytes instead of 4 and consume more memory/cache/memory bandwidth etc

Sourabh · July 15, 2019, 6:55am

Hi @Christian_Dahlqvist

I have 14 server of 500gb ram in which one data node is running and remaining servers have around 200gb of ram in which 2 data nodes are running in each servers.

system · August 12, 2019, 6:55am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fatal error in network and heap Elasticsearch	6	4259	March 4, 2018
My elasticsearch fatal error in thread Elasticsearch	11	10841	August 16, 2018
Java heap space issues Elasticsearch elastic-stack-monitoring , elastic-stack-alerting	2	872	April 5, 2021
Elasticsearch Cluster data node comes out of cluster frequently Elasticsearch	1	384	May 3, 2018
Searching from the big index - Java heap space exception Elasticsearch	3	490	July 6, 2017

Node goes down showing fatal error in network layer , thread and java heapspace error

Related topics