I am having a cluster with 5 master nodes,12 coordinator nodes and 60 data nodes .Currently i am doing heavy indexing in this es cluster around 15 billion documents spread through the day.We have 3 index which are undergoing heavy indexing there are four rollover in a day for each indexes.Each indexes having 100 shards and the replica is set to 1.The nodes are up on a physical server having 200gb of RAM ,each nodes have around 32gb of heap and the translog durability is set to async.
I am getting the following error and the node goes down.
[2019-07-03T21:28:59,331][ERROR][o.e.t.n.Netty4Utils ] fatal error on the network layer
at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:140)
[2019-07-03T21:28:59,336][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_8] fatal error in thread [Thread-34032], exiting
java.lang.OutOfMemoryError: Java heap space
[2019-07-03T21:28:59,397][WARN ][o.e.t.n.Netty4Transport ] [data_8] exception caught on transport layer [[id: 0x07b04e0b, L:/56.241.23.137:9303 - R:/56.241.23.147:35014]], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
Please suggest .This is a very trivial issue we are facing.
There have been several posts about a very similar cluster over the last few days and suggestions have been given. Have any of these made any difference? If the problem is trivial, why has it not been resolved?
No @dadoonet .
But going through his errors and similar posts I found out that we are doing same thing but at slightly less scale and slightly smaller cluster as compared to @Sourabh . But I am finding the same issue.
I did suggest decreasing the number of primary shards, but you still report having 100. Did you add any nodes to increase the available amount of heap? I thought you had 60 data nodes before I suggested adding more.
Exactly which suggestions did you try and what was the effect?
I have decreased the number of primary shards to 60 but still i am getting the following error.
[2019-07-11T07:54:17,148][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [data_15] fatal error in thread [Thread-142209], exiting
java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:656) ~[?:?]
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:221) ~[?:?]
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) ~[?:?]
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:272) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:151) ~[?:?]
at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:133) ~[?:?]
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:73) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) ~[?:?]
... 6 more
SymbolTable statistics:
Number of buckets : 20011 = 160088 bytes, avg 8.000
Number of entries : 148658 = 3567792 bytes, avg 24.000
Number of literals : 148658 = 9189616 bytes, avg 61.817
Total footprint : = 12917496 bytes
Average bucket size : 7.429
Variance of bucket size : 7.497
Std. dev. of bucket size: 2.738
Maximum bucket size : 20
StringTable statistics:
Number of buckets : 500009 = 4000072 bytes, avg 8.000
Number of entries : 23025 = 552600 bytes, avg 24.000
Number of literals : 23025 = 3138896 bytes, avg 136.326
Total footprint : = 7691568 bytes
Average bucket size : 0.046
Variance of bucket size : 0.046
Std. dev. of bucket size: 0.215
Maximum bucket size : 3
Due to resource restriction i can't increase the number of data nodes but i have reduced the number of shards to 60.The amount of heap available to each datanode is around 32gb.
A typical data node has 64GB RAM out of which ~30GB is allocated for heap. If your hosts have 200GB RAM you should be able to run 3 data nodes per host. This will naturally reduce the size of the OS page cache.
The server that has has 200gb of ram has low storage capacity if increase the number of datanodes then there will be lot of reallocation of shards across datanodes.
How much RAM do the nodes with large storage capacity have? If you could highlight the hardware profile of the different node types it would be easier to provide guidance.
And in logs today i saw new IndexWriter is closed exception.
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
When i run free -g command in the servers where i am running the nodes sometimes the there is no free space available all the free space is in the buffer.I have to free the cache using the command.
I have 14 server of 500gb ram in which one data node is running and remaining servers have around 200gb of ram in which 2 data nodes are running in each servers.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.