Node goes down showing fatal error in network layer , thread and java heapspace error

I am having a cluster with 5 master nodes,12 coordinator nodes and 60 data nodes .Currently i am doing heavy indexing in this es cluster around 15 billion documents spread through the day.We have 3 index which are undergoing heavy indexing there are four rollover in a day for each indexes.Each indexes having 100 shards and the replica is set to 1.The nodes are up on a physical server having 200gb of RAM ,each nodes have around 32gb of heap and the translog durability is set to async.

I am getting the following error and the node goes down.

[2019-07-03T21:28:59,331][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
	at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)

[2019-07-03T21:28:59,336][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_8] fatal error in thread [Thread-34032], exiting
java.lang.OutOfMemoryError: Java heap space

[2019-07-03T21:28:59,397][WARN ][o.e.t.n.Netty4Transport  ] [data_8] exception caught on transport layer [[id: 0x07b04e0b, L:/56.241.23.137:9303 - R:/56.241.23.147:35014]], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space

Please suggest .This is a very trivial issue we are facing.

I am having same issue.
Whenever a heavy query is fired on my cluster node timeout exception occur and sometime node goes down.

Hoping to hear you soon.

There have been several posts about a very similar cluster over the last few days and suggestions have been given. Have any of these made any difference? If the problem is trivial, why has it not been resolved?

Hi @Christian_Dahlqvist

I have tried those suggestions but this haven't made any difference for me.

Are you working with @Sourabh?

No @dadoonet .
But going through his errors and similar posts I found out that we are doing same thing but at slightly less scale and slightly smaller cluster as compared to @Sourabh . But I am finding the same issue.

So it's better to open your own question and give the details of your cluster such as:

GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

1 Like

I did suggest decreasing the number of primary shards, but you still report having 100. Did you add any nodes to increase the available amount of heap? I thought you had 60 data nodes before I suggested adding more.

Exactly which suggestions did you try and what was the effect?

I have decreased the number of primary shards to 60 but still i am getting the following error.

[2019-07-11T07:54:17,148][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_15] fatal error in thread [Thread-142209], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
... 6 more
SymbolTable statistics:
Number of buckets : 20011 = 160088 bytes, avg 8.000
Number of entries : 148658 = 3567792 bytes, avg 24.000
Number of literals : 148658 = 9189616 bytes, avg 61.817
Total footprint : = 12917496 bytes
Average bucket size : 7.429
Variance of bucket size : 7.497
Std. dev. of bucket size: 2.738
Maximum bucket size : 20
StringTable statistics:
Number of buckets : 500009 = 4000072 bytes, avg 8.000
Number of entries : 23025 = 552600 bytes, avg 24.000
Number of literals : 23025 = 3138896 bytes, avg 136.326
Total footprint : = 7691568 bytes
Average bucket size : 0.046
Variance of bucket size : 0.046
Std. dev. of bucket size: 0.215
Maximum bucket size : 3

Did you increase the number of data nodes and thus the total amount of heap available to the cluster?

Due to resource restriction i can't increase the number of data nodes but i have reduced the number of shards to 60.The amount of heap available to each datanode is around 32gb.

A typical data node has 64GB RAM out of which ~30GB is allocated for heap. If your hosts have 200GB RAM you should be able to run 3 data nodes per host. This will naturally reduce the size of the OS page cache.

The server that has has 200gb of ram has low storage capacity if increase the number of datanodes then there will be lot of reallocation of shards across datanodes.

How much RAM do the nodes with large storage capacity have? If you could highlight the hardware profile of the different node types it would be easier to provide guidance.

don't set a 32gb heap, 64-bit pointers will kick in, stick to <30gb

What does 64-bit pointer mean?

And everyday one or more node goes down with heap space error.And in the logs i see old gc.

And in logs today i saw new IndexWriter is closed exception.

org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed

When i run free -g command in the servers where i am running the nodes sometimes the there is no free space available all the free space is in the buffer.I have to free the cache using the command.

It means that object references on the heap are 8 bytes instead of 4 and consume more memory/cache/memory bandwidth etc

Hi @Christian_Dahlqvist

I have 14 server of 500gb ram in which one data node is running and remaining servers have around 200gb of ram in which 2 data nodes are running in each servers.