Elasticsearch cluster data nodes are being removed from cluster by timeout

Hi everyone,

I have an issue while deleting rather big 800Gb Elasticsearch index. Shortly after index deletion data nodes are getting disconnected, cluster loses all data nodes and switches into "red" state for ~10 minutes, after which cluster are getting reassembled and nodes reconnect back, shards getting reinitialized.

Have anyone encountered similar issues?

[2021-09-30T01:25:20,651][INFO ][o.e.m.j.JvmGcMonitorService] [esd03] [gc][3781315] overhead, spent [262ms] collecting in the last [1s]
[2021-09-30T02:37:29,098][INFO ][o.e.c.c.Coordinator      ] [esd03] master node [{esm01}{6KMwfRfVT5WpVOFOyZSgDQ}{9H72u7mXR4OYrd6ymCF70Q}{10.136.246.20}{10.136.246.20:9300}{m}{xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{esm01}{6KMwfRfVT5WpVOFOyZSgDQ}{9H72u7mXR4OYrd6ymCF70Q}{10.136.246.20}{10.136.246.20:9300}{m}{xpack.installed=true, transform.node=false}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:277) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:275) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:267) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:131) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

Welcome to our community! :smiley:

What is the output from the _cluster/stats?pretty&human API?
What do the logs from one of the data nodes show?
Is ILM deleting the index?

Thanks Mark!

I'll collect the stats output as soon as I can reproduce this.

I'm attaching logs in the link below.

Index seems to be deleted properly, just data nodes become unresponsive and ejected from the cluster, leaving only master nodes running :slight_smile:

https://paste.c-net.org/KidderStripe

Index is being deleted manually, by curl

curl -X DELETE "http://${ENDPOINT}/${INDEX_NAME}?master_timeout=5m&timeout=5m"

Here is stats output (this is during normal operation, I'll catch it while deleting when I'll reproduce it)

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "cluster",
  "cluster_uuid" : "vGdelUhbRwaz8qK1XE5bgQ",
  "timestamp" : 1633106182931,
  "status" : "green",
  "indices" : {
    "count" : 2,
    "shards" : {
      "total" : 122,
      "primaries" : 31,
      "replication" : 2.935483870967742,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 120,
          "avg" : 61.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 30,
          "avg" : 15.5
        },
        "replication" : {
          "min" : 1.0,
          "max" : 3.0,
          "avg" : 2.0
        }
      }
    },
    "docs" : {
      "count" : 80127376,
      "deleted" : 35877714
    },
    "store" : {
      "size" : "706.7gb",
      "size_in_bytes" : 758838143435,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "405.4mb",
      "memory_size_in_bytes" : 425110020,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "279.5mb",
      "memory_size_in_bytes" : 293159533,
      "total_count" : 402389086,
      "hit_count" : 6318560,
      "miss_count" : 396070526,
      "cache_size" : 27897,
      "cache_count" : 330063,
      "evictions" : 302166
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 3662,
      "memory" : "179.2mb",
      "memory_in_bytes" : 187926716,
      "terms_memory" : "110.4mb",
      "terms_memory_in_bytes" : 115828576,
      "stored_fields_memory" : "2.1mb",
      "stored_fields_memory_in_bytes" : 2238080,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "13.9mb",
      "norms_memory_in_bytes" : 14630848,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "52.6mb",
      "doc_values_memory_in_bytes" : 55229212,
      "index_writer_memory" : "3.6gb",
      "index_writer_memory_in_bytes" : 3869803080,
      "version_map_memory" : "5mb",
      "version_map_memory_in_bytes" : 5332753,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 12,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 63,
          "index_count" : 1
        },
        {
          "name" : "double",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 10,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 175,
          "index_count" : 2
        },
        {
          "name" : "long",
          "count" : 62,
          "index_count" : 1
        },
        {
          "name" : "object",
          "count" : 37,
          "index_count" : 1
        },
        {
          "name" : "text",
          "count" : 139,
          "index_count" : 2
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "standard",
          "count" : 2,
          "index_count" : 1
        }
      ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "coordinating_only" : 3,
      "data" : 15,
      "data_cold" : 15,
      "data_content" : 15,
      "data_hot" : 15,
      "data_warm" : 15,
      "ingest" : 0,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 15,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.0"
    ],
    "os" : {
      "available_processors" : 560,
      "allocated_processors" : 560,
      "names" : [
        {
          "name" : "Linux",
          "count" : 23
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux",
          "count" : 23
        }
      ],
      "mem" : {
        "total" : "11.3tb",
        "total_in_bytes" : 12436978905088,
        "free" : "9tb",
        "free_in_bytes" : 9998055317504,
        "used" : "2.2tb",
        "used_in_bytes" : 2438923587584,
        "free_percent" : 80,
        "used_percent" : 20
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 341
      },
      "open_file_descriptors" : {
        "min" : 809,
        "max" : 1651,
        "avg" : 1142
      }
    },
    "jvm" : {
      "max_uptime" : "45.4d",
      "max_uptime_in_millis" : 3926854995,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 23
        }
      ],
      "mem" : {
        "heap_used" : "285.9gb",
        "heap_used_in_bytes" : 307041506480,
        "heap_max" : "690gb",
        "heap_max_in_bytes" : 740881858560
      },
      "threads" : 2476
    },
    "fs" : {
      "total" : "13.4tb",
      "total_in_bytes" : 14810389426176,
      "free" : "12.6tb",
      "free_in_bytes" : 13905305006080,
      "available" : "12.6tb",
      "available_in_bytes" : 13905305006080
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 23
      },
      "http_types" : {
        "security4" : 23
      }
    },
    "discovery_types" : {
      "zen" : 23
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 23
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Where are you running your cluster? On-premises? Cloud Services?

I had an issue with timeout where cluster would randomly leave the cluster for a couple of seconds, enough time to make the status go to red.

[running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@33e8dc00]] took [738687ms]

That'd explain it, it's taking well over 12 minutes to apply a cluster state. It might be interesting to see what those nodes are doing in that time (using JFR or even just jstack) but my guess would be egregiously slow disks.

On prem physical hardware. It's red for about ~10 minutes until rebuilt.

I'll try to collect. Disks are pcie ssd in raid. Nodes are queried heavily all the time.

Drives are RAID 1 2x HP 2TB
2.5-inch SFF SAS Read-Intensive (RI)
12Gb/s Solid State Drive (SSD)
In HP Hot-Plug SmartDrive Carrier (SC)

Also I'm deleting the index soon after new index has been built, if it can affect anything.

Would you suggest to adjust any jvm settings or something else to speed up the deletion process? Or maybe enable more verbose logging? I run the same dataset on the cluster with similar hardware with 3x less nodes and deletion happens there in seconds without causing timeouts.

No, I wouldn't suggest changing anything (nor do I think there's anything to change in this case). Just get some more details about what Elasticsearch is doing for those 12 minutes please.

Hi David,

Which specific measurements/flags can be helpful?

Here is jstat from non-prod and prod env (similar hardware)

[TEST ENV]
sudo /usr/share/elasticsearch/jdk/bin/jstat -gc 9647
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   49152.0  0.0   37339.1 19775488.0 11108352.0 11632640.0 2061896.0  114096.0 109325.9 14744.0 13652.1 105865 5247.674   0      0.000 2552    15.387 5263.060
[PROD ENV]sudo /usr/share/elasticsearch/jdk/bin/jstat -gc 2533
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   540672.0  0.0   540672.0 19283968.0 15679488.0 11632640.0 2255570.0  115280.0 110346.6 14848.0 13605.3 324931 10933.996   0      0.000  810     3.073 10937.068

As I said:

While we are figuring this out, is it possible to set timeouts to 15 minutes + so nodes become unresponsive but don’t get kicked from the cluster?

in this case once deletion is done they should report back

and we save 10 minutes on reinitializing the shards

I.e. aside of checking jvm activity I plan to do is to temporary increase the cluster.publish.timeout to something like 15 minutes, to allow cluster stay up even if nodes aren't reporting for this period of time, can it create any issues?

Hi Mark,

Do you think closing index before the deletion can have any positive impact? Or other potential tweaks which can make index deletion slower but safe?

It's best not to start changing things without understanding the problem first. Once you've shared the info needed to diagnose what's wrong we can start to think about remedies.

I agree, waiting for it to repeat. JFR for some reason doesn't dump to file though, getting:

/usr/share/elasticsearch/jdk/bin/jcmd 9647 JFR.start name=elastic duration=60s filename=/root/flight.jfr
9647:
Could not start recording, not able to write to file /root/flight.jfr. /root/flight.jfr

which seems to be misleading, file is writable, I can also just JFR.start and recording is going on but can't access it.

Will use jstack if this won't work out.