Nodes crashing with unable to create native thread error

I have an elastic cluster that had some nodes crash due to the "java.lang.OutOfMemoryError: unable to create native thread" error. Found some other posts here with the same error, but no useful information yet.

The cluster has 28 data nodes (ES v7.1.1), 4 masters, and 4 client-only nodes running across four physical hosts running CentOS 7 (7 data, 1 master, 1 client on each). The hosts have 512GB RAM, 28 cores, and each data node has its own 800GB RAID set. JVM heap is set to 31GB, 4GB, and 8GB for data, master and client nodes respectively, and thread pool limits haven't been modified from the default.

Data wise, there are 2,122 indices, 25,369 shards, ~45 billion docs and 15 TB total data. We have about 500 million docs a day coming in, and the same being deleted, so total size is fairly stable.

9 of the 28 data nodes on two of the hosts crashed within about 2 seconds of each other with the same error. It seems really odd that if it was running into resource limits "naturally", that it would hit those limits that close together on two different physical machines.

I was able to start the nodes back up, but I need to find a root cause before it happens again. I had the same data set on a 6-VM cluster (3 master, 3 client, 12 data nodes) with no problems for months, so it seems like it should be more comfortable now.

Below is the error, and thread pool/ulimit information

[2020-01-25T03:02:11,541][WARN ][i.n.c.AbstractChannelHandlerContext] [xxxxxxxels107-1] An exception 'java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached' [enable DEBUG level for full stacktrace] was thrown by a user handler's exceptionCaught() method while handling the following exception:
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        at java.lang.Thread.start0(Native Method) ~[?:?]
        at java.lang.Thread.start(Thread.java:804) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1354) ~[?:?]
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1036) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:922) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:753) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) ~[transport-netty4-client-7.1.1.jar:7.1.1]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]

Process limits for one of the data nodes:
    cat /proc/53839/limits 
    Limit                     Soft Limit           Hard Limit           Units     
    Max cpu time              unlimited            unlimited            seconds   
    Max file size             unlimited            unlimited            bytes     
    Max data size             unlimited            unlimited            bytes     
    Max stack size            8388608              unlimited            bytes     
    Max core file size        0                    unlimited            bytes     
    Max resident set          unlimited            unlimited            bytes     
    Max processes             4096                 4096                 processes 
    Max open files            65535                65535                files     
    Max locked memory         65536                65536                bytes     
    Max address space         unlimited            unlimited            bytes     
    Max file locks            unlimited            unlimited            locks     
    Max pending signals       2061963              2061963              signals   
    Max msgqueue size         819200               819200               bytes     
    Max nice priority         0                    0                    
    Max realtime priority     0                    0                    
    Max realtime timeout      unlimited            unlimited            us        


    Thead pool settings:
      "thread_pool" : {
    "watcher" : {
      "type" : "fixed",
      "size" : 56,
      "queue_size" : 1000
    },
    "force_merge" : {
      "type" : "fixed",
      "size" : 1,
      "queue_size" : -1
    },
    "security-token-key" : {
      "type" : "fixed",
      "size" : 1,
      "queue_size" : 1000
    },
    "fetch_shard_started" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 112,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "listener" : {
      "type" : "fixed",
      "size" : 10,
      "queue_size" : -1
    },
    "refresh" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 10,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "generic" : {
      "type" : "scaling",
      "core" : 4,
      "max" : 224,
      "keep_alive" : "30s",
      "queue_size" : -1
    },
    "rollup_indexing" : {
      "type" : "fixed",
      "size" : 4,
      "queue_size" : 4
    },
    "warmer" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 5,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "search" : {
      "type" : "fixed_auto_queue_size",
      "size" : 85,
      "queue_size" : 1000
    },
    "ccr" : {
      "type" : "fixed",
      "size" : 32,
      "queue_size" : 100
    },
    "flush" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 5,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "fetch_shard_store" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 112,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "management" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 5,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "get" : {
      "type" : "fixed",
      "size" : 56,
      "queue_size" : 1000
    },
    "analyze" : {
      "type" : "fixed",
      "size" : 1,
      "queue_size" : 16
    },
    "write" : {
      "type" : "fixed",
      "size" : 56,
      "queue_size" : 200
    },
    "snapshot" : {
      "type" : "scaling",
      "core" : 1,
      "max" : 5,
      "keep_alive" : "5m",
      "queue_size" : -1
    },
    "search_throttled" : {
      "type" : "fixed_auto_queue_size",
      "size" : 1,
      "queue_size" : 100
    }
      }

So after some additional crashes, I've been able to correlate them with someone running queries that appear to error out with "too many buckets".

Is there something I can tweak so that the query fails and the cluster doesn't?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.