I have an elastic cluster that had some nodes crash due to the "java.lang.OutOfMemoryError: unable to create native thread" error. Found some other posts here with the same error, but no useful information yet.
The cluster has 28 data nodes (ES v7.1.1), 4 masters, and 4 client-only nodes running across four physical hosts running CentOS 7 (7 data, 1 master, 1 client on each). The hosts have 512GB RAM, 28 cores, and each data node has its own 800GB RAID set. JVM heap is set to 31GB, 4GB, and 8GB for data, master and client nodes respectively, and thread pool limits haven't been modified from the default.
Data wise, there are 2,122 indices, 25,369 shards, ~45 billion docs and 15 TB total data. We have about 500 million docs a day coming in, and the same being deleted, so total size is fairly stable.
9 of the 28 data nodes on two of the hosts crashed within about 2 seconds of each other with the same error. It seems really odd that if it was running into resource limits "naturally", that it would hit those limits that close together on two different physical machines.
I was able to start the nodes back up, but I need to find a root cause before it happens again. I had the same data set on a 6-VM cluster (3 master, 3 client, 12 data nodes) with no problems for months, so it seems like it should be more comfortable now.
Below is the error, and thread pool/ulimit information
[2020-01-25T03:02:11,541][WARN ][i.n.c.AbstractChannelHandlerContext] [xxxxxxxels107-1] An exception 'java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.lang.Thread.start0(Native Method) ~[?:?]
at java.lang.Thread.start(Thread.java:804) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1354) ~[?:?]
at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1036) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753) ~[elasticsearch-7.1.1.jar:7.1.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[transport-netty4-client-7.1.1.jar:7.1.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
at java.lang.Thread.run(Thread.java:835) [?:?]
Process limits for one of the data nodes:
cat /proc/53839/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 4096 4096 processes
Max open files 65535 65535 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 2061963 2061963 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Thead pool settings:
"thread_pool" : {
"watcher" : {
"type" : "fixed",
"size" : 56,
"queue_size" : 1000
},
"force_merge" : {
"type" : "fixed",
"size" : 1,
"queue_size" : -1
},
"security-token-key" : {
"type" : "fixed",
"size" : 1,
"queue_size" : 1000
},
"fetch_shard_started" : {
"type" : "scaling",
"core" : 1,
"max" : 112,
"keep_alive" : "5m",
"queue_size" : -1
},
"listener" : {
"type" : "fixed",
"size" : 10,
"queue_size" : -1
},
"refresh" : {
"type" : "scaling",
"core" : 1,
"max" : 10,
"keep_alive" : "5m",
"queue_size" : -1
},
"generic" : {
"type" : "scaling",
"core" : 4,
"max" : 224,
"keep_alive" : "30s",
"queue_size" : -1
},
"rollup_indexing" : {
"type" : "fixed",
"size" : 4,
"queue_size" : 4
},
"warmer" : {
"type" : "scaling",
"core" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"search" : {
"type" : "fixed_auto_queue_size",
"size" : 85,
"queue_size" : 1000
},
"ccr" : {
"type" : "fixed",
"size" : 32,
"queue_size" : 100
},
"flush" : {
"type" : "scaling",
"core" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"fetch_shard_store" : {
"type" : "scaling",
"core" : 1,
"max" : 112,
"keep_alive" : "5m",
"queue_size" : -1
},
"management" : {
"type" : "scaling",
"core" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"get" : {
"type" : "fixed",
"size" : 56,
"queue_size" : 1000
},
"analyze" : {
"type" : "fixed",
"size" : 1,
"queue_size" : 16
},
"write" : {
"type" : "fixed",
"size" : 56,
"queue_size" : 200
},
"snapshot" : {
"type" : "scaling",
"core" : 1,
"max" : 5,
"keep_alive" : "5m",
"queue_size" : -1
},
"search_throttled" : {
"type" : "fixed_auto_queue_size",
"size" : 1,
"queue_size" : 100
}
}