Elasticsearch cluster data nodes are being removed from cluster by timeout

Vadym · September 30, 2021, 10:38pm

Hi everyone,

I have an issue while deleting rather big 800Gb Elasticsearch index. Shortly after index deletion data nodes are getting disconnected, cluster loses all data nodes and switches into "red" state for ~10 minutes, after which cluster are getting reassembled and nodes reconnect back, shards getting reinitialized.

Have anyone encountered similar issues?

[2021-09-30T01:25:20,651][INFO ][o.e.m.j.JvmGcMonitorService] [esd03] [gc][3781315] overhead, spent [262ms] collecting in the last [1s]
[2021-09-30T02:37:29,098][INFO ][o.e.c.c.Coordinator      ] [esd03] master node [{esm01}{6KMwfRfVT5WpVOFOyZSgDQ}{9H72u7mXR4OYrd6ymCF70Q}{10.136.246.20}{10.136.246.20:9300}{m}{xpack.installed=true, transform.node=false}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{esm01}{6KMwfRfVT5WpVOFOyZSgDQ}{9H72u7mXR4OYrd6ymCF70Q}{10.136.246.20}{10.136.246.20:9300}{m}{xpack.installed=true, transform.node=false}] failed [3] consecutive checks
	at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:293) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1181) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:277) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:224) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:275) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:267) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:131) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:89) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:700) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:142) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.0.jar:7.10.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

warkolm · September 30, 2021, 10:40pm

Welcome to our community!

What is the output from the _cluster/stats?pretty&human API?
What do the logs from one of the data nodes show?
Is ILM deleting the index?

Vadym · October 1, 2021, 4:25pm

Thanks Mark!

I'll collect the stats output as soon as I can reproduce this.

I'm attaching logs in the link below.

Index seems to be deleted properly, just data nodes become unresponsive and ejected from the cluster, leaving only master nodes running

https://paste.c-net.org/KidderStripe

Vadym · October 1, 2021, 4:33pm

Index is being deleted manually, by curl

curl -X DELETE "http://${ENDPOINT}/${INDEX_NAME}?master_timeout=5m&timeout=5m"

Vadym · October 1, 2021, 4:38pm

Here is stats output (this is during normal operation, I'll catch it while deleting when I'll reproduce it)

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "cluster",
  "cluster_uuid" : "vGdelUhbRwaz8qK1XE5bgQ",
  "timestamp" : 1633106182931,
  "status" : "green",
  "indices" : {
    "count" : 2,
    "shards" : {
      "total" : 122,
      "primaries" : 31,
      "replication" : 2.935483870967742,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 120,
          "avg" : 61.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 30,
          "avg" : 15.5
        },
        "replication" : {
          "min" : 1.0,
          "max" : 3.0,
          "avg" : 2.0
        }
      }
    },
    "docs" : {
      "count" : 80127376,
      "deleted" : 35877714
    },
    "store" : {
      "size" : "706.7gb",
      "size_in_bytes" : 758838143435,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "405.4mb",
      "memory_size_in_bytes" : 425110020,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "279.5mb",
      "memory_size_in_bytes" : 293159533,
      "total_count" : 402389086,
      "hit_count" : 6318560,
      "miss_count" : 396070526,
      "cache_size" : 27897,
      "cache_count" : 330063,
      "evictions" : 302166
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 3662,
      "memory" : "179.2mb",
      "memory_in_bytes" : 187926716,
      "terms_memory" : "110.4mb",
      "terms_memory_in_bytes" : 115828576,
      "stored_fields_memory" : "2.1mb",
      "stored_fields_memory_in_bytes" : 2238080,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "13.9mb",
      "norms_memory_in_bytes" : 14630848,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "52.6mb",
      "doc_values_memory_in_bytes" : 55229212,
      "index_writer_memory" : "3.6gb",
      "index_writer_memory_in_bytes" : 3869803080,
      "version_map_memory" : "5mb",
      "version_map_memory_in_bytes" : 5332753,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 12,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 63,
          "index_count" : 1
        },
        {
          "name" : "double",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "float",
          "count" : 2,
          "index_count" : 1
        },
        {
          "name" : "integer",
          "count" : 10,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 175,
          "index_count" : 2
        },
        {
          "name" : "long",
          "count" : 62,
          "index_count" : 1
        },
        {
          "name" : "object",
          "count" : 37,
          "index_count" : 1
        },
        {
          "name" : "text",
          "count" : 139,
          "index_count" : 2
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "standard",
          "count" : 2,
          "index_count" : 1
        }
      ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "coordinating_only" : 3,
      "data" : 15,
      "data_cold" : 15,
      "data_content" : 15,
      "data_hot" : 15,
      "data_warm" : 15,
      "ingest" : 0,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 15,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.0"
    ],
    "os" : {
      "available_processors" : 560,
      "allocated_processors" : 560,
      "names" : [
        {
          "name" : "Linux",
          "count" : 23
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux",
          "count" : 23
        }
      ],
      "mem" : {
        "total" : "11.3tb",
        "total_in_bytes" : 12436978905088,
        "free" : "9tb",
        "free_in_bytes" : 9998055317504,
        "used" : "2.2tb",
        "used_in_bytes" : 2438923587584,
        "free_percent" : 80,
        "used_percent" : 20
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 341
      },
      "open_file_descriptors" : {
        "min" : 809,
        "max" : 1651,
        "avg" : 1142
      }
    },
    "jvm" : {
      "max_uptime" : "45.4d",
      "max_uptime_in_millis" : 3926854995,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 23
        }
      ],
      "mem" : {
        "heap_used" : "285.9gb",
        "heap_used_in_bytes" : 307041506480,
        "heap_max" : "690gb",
        "heap_max_in_bytes" : 740881858560
      },
      "threads" : 2476
    },
    "fs" : {
      "total" : "13.4tb",
      "total_in_bytes" : 14810389426176,
      "free" : "12.6tb",
      "free_in_bytes" : 13905305006080,
      "available" : "12.6tb",
      "available_in_bytes" : 13905305006080
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 23
      },
      "http_types" : {
        "security4" : 23
      }
    },
    "discovery_types" : {
      "zen" : 23
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 23
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

leandrojmp · October 1, 2021, 4:41pm

Where are you running your cluster? On-premises? Cloud Services?

I had an issue with timeout where cluster would randomly leave the cluster for a couple of seconds, enough time to make the status go to red.

DavidTurner · October 1, 2021, 4:49pm

[running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@33e8dc00]] took [738687ms]

That'd explain it, it's taking well over 12 minutes to apply a cluster state. It might be interesting to see what those nodes are doing in that time (using JFR or even just jstack) but my guess would be egregiously slow disks.

Vadym · October 1, 2021, 4:56pm

On prem physical hardware. It's red for about ~10 minutes until rebuilt.

Vadym · October 1, 2021, 4:59pm

I'll try to collect. Disks are pcie ssd in raid. Nodes are queried heavily all the time.

Drives are RAID 1 2x HP 2TB
2.5-inch SFF SAS Read-Intensive (RI)
12Gb/s Solid State Drive (SSD)
In HP Hot-Plug SmartDrive Carrier (SC)

Also I'm deleting the index soon after new index has been built, if it can affect anything.

Vadym · October 1, 2021, 8:39pm

Would you suggest to adjust any jvm settings or something else to speed up the deletion process? Or maybe enable more verbose logging? I run the same dataset on the cluster with similar hardware with 3x less nodes and deletion happens there in seconds without causing timeouts.

DavidTurner · October 1, 2021, 9:00pm

No, I wouldn't suggest changing anything (nor do I think there's anything to change in this case). Just get some more details about what Elasticsearch is doing for those 12 minutes please.

Vadym · October 1, 2021, 11:32pm

Hi David,

Which specific measurements/flags can be helpful?

Here is jstat from non-prod and prod env (similar hardware)

[TEST ENV]
sudo /usr/share/elasticsearch/jdk/bin/jstat -gc 9647
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   49152.0  0.0   37339.1 19775488.0 11108352.0 11632640.0 2061896.0  114096.0 109325.9 14744.0 13652.1 105865 5247.674   0      0.000 2552    15.387 5263.060
[PROD ENV]sudo /usr/share/elasticsearch/jdk/bin/jstat -gc 2533
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
 0.0   540672.0  0.0   540672.0 19283968.0 15679488.0 11632640.0 2255570.0  115280.0 110346.6 14848.0 13605.3 324931 10933.996   0      0.000  810     3.073 10937.068

DavidTurner · October 2, 2021, 4:59am

As I said:

Vadym · October 12, 2021, 6:18pm

While we are figuring this out, is it possible to set timeouts to 15 minutes + so nodes become unresponsive but don’t get kicked from the cluster?

in this case once deletion is done they should report back

and we save 10 minutes on reinitializing the shards

Vadym · October 13, 2021, 6:41pm

I.e. aside of checking jvm activity I plan to do is to temporary increase the cluster.publish.timeout to something like 15 minutes, to allow cluster stay up even if nodes aren't reporting for this period of time, can it create any issues?

Vadym · October 14, 2021, 1:27am

Hi Mark,

Do you think closing index before the deletion can have any positive impact? Or other potential tweaks which can make index deletion slower but safe?

DavidTurner · October 14, 2021, 4:48am

It's best not to start changing things without understanding the problem first. Once you've shared the info needed to diagnose what's wrong we can start to think about remedies.

Vadym · October 14, 2021, 10:43pm

I agree, waiting for it to repeat. JFR for some reason doesn't dump to file though, getting:

/usr/share/elasticsearch/jdk/bin/jcmd 9647 JFR.start name=elastic duration=60s filename=/root/flight.jfr
9647:
Could not start recording, not able to write to file /root/flight.jfr. /root/flight.jfr

which seems to be misleading, file is writable, I can also just JFR.start and recording is going on but can't access it.

Will use jstack if this won't work out.

system · November 11, 2021, 10:44pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sudden data loss! Elasticsearch	15	5433	July 6, 2017
Getting Timeout Exceptions with Elasticsearch Elasticsearch	12	10313	July 6, 2017
Data node constantly dropping out of the cluster Elasticsearch	19	7289	October 2, 2019
Timeouts while deleting Elasticsearch	27	9461	July 6, 2017
Clustering - removing index from second node in cluster leads to bad state / full sync? Elasticsearch	3	345	July 6, 2017

Elasticsearch cluster data nodes are being removed from cluster by timeout

Related topics