We have cluster of 4 nodes, where 2 nodes are master and data and other 2 nodes are data nodes, the configuration was working fine since 2 yrs, today we have to restart the cluster and since then we are getting master not discovered exception

Please fine attached the logs for master 1 and master 2

[2023-08-04T20:42:56,086][WARN ][r.suppressed             ] [ES-Master-2] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:42:58,435][WARN ][r.suppressed             ] [ES-Master-2] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:42:58,468][WARN ][r.suppressed             ] [ES-Master-2] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:45:14,601][TRACE][o.e.d.PeerFinder         ] [ES-Master-1] not active
[2023-08-04T20:45:14,674][TRACE][o.e.d.PeerFinder         ] [ES-Master-1] not active
[2023-08-04T20:45:16,121][WARN ][r.suppressed             ] [ES-Master-1] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:45:19,912][WARN ][r.suppressed             ] [ES-Master-1] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:45:21,332][WARN ][r.suppressed             ] [ES-Master-1] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:45:21,341][WARN ][r.suppressed             ] [ES-Master-1] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-04T20:45:23,447][WARN ][r.suppressed             ] [ES-Master-1] path: /_license, params: {human=false}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]

I am not sure what this issue could be?

here is the master node config file:

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: icon-es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ES-Master-1
node.master: true
node.data: true
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/els/ES-Master-1/data
#
# Path to log files:
#
path.logs: /mnt/els/ES-Master-1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: ES-Master-1
#
# Set a custom port for HTTP:
#
http.port: 9720
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
cluster.initial_master_nodes: ["ES-Master-1", "ES-Master-2"]
cluster.routing.allocation.node_concurrent_incoming_recoveries: 200
cluster.routing.allocation.node_concurrent_recoveries: 200
cluster.routing.allocation.node_initial_primaries_recoveries: 200
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
logger.org.elasticsearch.cluster.coordination.ClusterBootstrapService: TRACE
logger.org.elasticsearch.discovery: TRACE
thread_pool.write.queue_size: 1000
transport.tcp.port: 9840
#SSL--------------------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.type: PKCS12
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.type: PKCS12
xpack.security.http.ssl.enabled: true
######
#xpack.security.http.ssl.keystore.type: PKCS12
#xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.type: PKCS12
#xpack.security.http.ssl.client_authentication: optional
###network.publish_host: ES-Master-1
#######
xpack.security.http.ssl.keystore.type: PKCS12
xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.type: PKCS12
xpack.security.http.ssl.client_authentication: optional
#
discovery.zen.minimum_master_nodes: 2
path.repo: /mnt/els/els-snapshots

here is the data node config file:

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: icon-es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ES-Aggr-1
node.master: false
node.data: true
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/els/ES-Aggr-1/data
#
# Path to log files:
#
path.logs: /mnt/els/ES-Aggr-1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: ES-Aggr-1
#
# Set a custom port for HTTP:
#
http.port: 9720
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["ES-Master-1", "ES-Master-2"]
cluster.routing.allocation.node_concurrent_incoming_recoveries: 200
cluster.routing.allocation.node_concurrent_recoveries: 200
cluster.routing.allocation.node_initial_primaries_recoveries: 200
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
logger.org.elasticsearch.cluster.coordination.ClusterBootstrapService: TRACE
logger.org.elasticsearch.discovery: TRACE
thread_pool.write.queue_size: 1000
transport.tcp.port: 9840
#SSL----------------------------------------------------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.type: PKCS12
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.type: PKCS12
xpack.security.http.ssl.enabled: true
#xpack.security.http.ssl.keystore.type: PKCS12
#xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.type: PKCS12
#xpack.security.http.ssl.client_authentication: optional

xpack.security.http.ssl.keystore.type: PKCS12
xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.type: PKCS12
xpack.security.http.ssl.client_authentication: optional
#xpack.security.http.ssl.certificate_authorities: [ "/opt/apm/elasticsearch/config/2021/DigiCertCA.crt" ]
path.repo: /mnt/els/els-snapshots

Hello,

You had a similar issue a couple of days ago where it was suggest to remove the data nodes from the discovery.seed_hosts settings and also use the default value for the allocation settings, it was this issue.

But it seems that you didn't change anything.

Also, you are using non-default ports, if I'm not wrong you need to specify that in the discovery.seed_hosts.

Your discovery.seed_hosts needs to look like this in both of your masters, change it and try to restart them again.

discovery.seed_hosts: ["ES-Master-1:9840", "ES-Master-2:9840"]
1 Like

@leandrojmp it worked fine for few days with same old config. We do have this setting in data nodes as well, what should we put there?

The same config, discovery.seed_hosts needs to have only the masters, also, since your cluster is already formed you need to remove cluster.initial_master_nodes as well, this is only need in the first cluster bootstrap.

1 Like

I have added all the configuration you told but i am still getting the same errors, not sure if that was causing the issue, can it be a split brain issue?

here is the logs after running it with the settings suggested by you.

[2023-08-05T15:02:00,159][WARN ][o.e.g.PersistedClusterStateService] [ES-Master-2] writing cluster state took [41417ms] which is above the warn threshold of [10s]; wrote full state with [263] indices
[2023-08-05T15:02:00,168][WARN ][o.e.c.c.ClusterFormationFailureHelper] [ES-Master-2] master not discovered or elected yet, an election requires a node with id [nfO_NdY6QdK5PWI1-Qj6Ew], have discovered possible quorum [{ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}]; discovery will continue using [10.182.4.184:9840] from hosts providers and [{ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{oNU2lvt9SKSAfKo8OPsCkA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}, {ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}] from last-known cluster state; node term 4615, last-accepted version 3760882 in term 4615
[2023-08-05T15:02:00,169][INFO ][o.e.c.s.ClusterApplierService] [ES-Master-2] master node changed {previous [], current [{ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}]}, term: 4615, version: 3760882, reason: Publication{term=4615, version=3760882}
[2023-08-05T15:02:00,169][INFO ][o.e.c.c.C.CoordinatorPublication] [ES-Master-2] after [41.3s] publication of cluster state version [3760882] is still waiting for {ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{oNU2lvt9SKSAfKo8OPsCkA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true} [SENT_PUBLISH_REQUEST], {ES-Aggr-1}{beiqGC2KTeuwlx4ZOvkulQ}{yRgOXpKwRtmuqcf27evWbw}{ES-Aggr-1}{10.182.4.170:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true} [SENT_PUBLISH_REQUEST], {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{dzPxvKZRQ0Cuewc5K1ES-w}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true} [SENT_PUBLISH_REQUEST]
[2023-08-05T15:02:00,781][WARN ][o.e.c.c.C.CoordinatorPublication] [ES-Master-2] after [41.9s] publication of cluster state version [3760882] is still waiting for {ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{oNU2lvt9SKSAfKo8OPsCkA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true} [SENT_APPLY_COMMIT]
[2023-08-05T15:02:15,762][WARN ][o.e.c.InternalClusterInfoService] [ES-Master-2] failed to retrieve stats for node [nfO_NdY6QdK5PWI1-Qj6Ew]: [ES-Master-2][10.182.5.183:9840][cluster:monitor/nodes/stats[n]] request_id [707608] timed out after [15005ms]
[2023-08-05T15:02:15,764][WARN ][o.e.c.InternalClusterInfoService] [ES-Master-2] failed to retrieve shard stats from node [nfO_NdY6QdK5PWI1-Qj6Ew]: [ES-Master-2][10.182.5.183:9840][indices:monitor/stats[n]] request_id [707610] timed out after [15005ms]
[2023-08-05T15:02:32,544][WARN ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [ES-Master-2] http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.182.5.183:9720, remoteAddress=/10.182.0.12:59098}
[2023-08-05T15:02:32,544][WARN ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [ES-Master-2] http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.182.5.183:9720, remoteAddress=/10.182.0.12:59097}
[2023-08-05T15:02:34,913][WARN ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [ES-Master-2] http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.182.5.183:9720, remoteAddress=/10.182.0.12:59100}
[2023-08-05T15:02:35,372][WARN ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [ES-Master-2] http client did not trust this server's certificate, closing connection Netty4HttpChannel{localAddress=/10.182.5.183:9720, remoteAddress=/10.182.0.12:59101}
[2023-08-05T15:02:44,016][WARN ][o.e.t.TransportService   ] [ES-Master-2] Received response for a request that has timed out, sent [12.9m/774782ms] ago, timed out [12.6m/759777ms] ago, action [indices:monitor/stats[n]], node [{ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}], id [703255]
[2023-08-05T15:02:44,018][WARN ][o.e.g.PersistedClusterStateService] [ES-Master-2] writing cluster state took [37935ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [262] unchanged indices
[2023-08-05T15:02:44,018][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] activating with nodes:
   {ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{oNU2lvt9SKSAfKo8OPsCkA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Aggr-1}{beiqGC2KTeuwlx4ZOvkulQ}{yRgOXpKwRtmuqcf27evWbw}{ES-Aggr-1}{10.182.4.170:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{dzPxvKZRQ0Cuewc5K1ES-w}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}, local, master

[2023-08-05T15:02:44,018][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] probing master nodes from cluster state: nodes:
   {ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{oNU2lvt9SKSAfKo8OPsCkA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Aggr-1}{beiqGC2KTeuwlx4ZOvkulQ}{yRgOXpKwRtmuqcf27evWbw}{ES-Aggr-1}{10.182.4.170:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{dzPxvKZRQ0Cuewc5K1ES-w}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}, local, master

[2023-08-05T15:02:44,018][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.4.184:9840], node [null], requesting [false] attempting connection
[2023-08-05T15:02:44,018][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] startProbe(10.182.5.183:9840) not probing local node
[2023-08-05T15:02:44,019][INFO ][o.e.c.s.ClusterApplierService] [ES-Master-2] master node changed {previous [{ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{gVtvzT2dRne3pwx57wB1YQ}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}], current []}, term: 4615, version: 3760883, reason: becoming candidate: Publication.onCompletion(false)
[2023-08-05T15:02:44,019][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.4.184:9840]] opening probe connection
[2023-08-05T15:02:44,020][WARN ][o.e.c.s.MasterService    ] [ES-Master-2] failing [shard-started StartedShardEntry{shardId [[.ds-metricbeat-mgmttech1--2023.04.20-000001][1]], allocationId [QXc6g0vVREmyvBvfZzJ0zw], primary term [160], message [after peer recovery]}[StartedShardEntry{shardId [[.ds-metricbeat-mgmttech1--2023.04.20-000001][1]], allocationId [QXc6g0vVREmyvBvfZzJ0zw], primary term [160], message [after peer recovery]}]]: failed to commit cluster state version [3760884]
org.elasticsearch.cluster.coordination.FailedToCommitClusterStateException: publication failed
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$4.onFailure(Coordinator.java:1772) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:115) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:55) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication.onCompletion(Coordinator.java:1679) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.coordination.Publication.onPossibleCompletion(Publication.java:114) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:78) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.coordination.Coordinator$CoordinatorPublication$2.run(Coordinator.java:1630) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.elasticsearch.ElasticsearchException: publication cancelled before committing: timed out after 30s
        at org.elasticsearch.cluster.coordination.Publication.cancel(Publication.java:75) ~[elasticsearch-7.16.2.jar:7.16.2]
        ... 5 more
[2023-08-05T15:02:44,419][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.4.184:9840]] opened probe connection

its still not getting up.

The problem is here: your storage is so slow that it appears to be broken. You need to work out why it's so slow.

Are you using some kind of networked storage?

Note that this setting has been deprecated in Elasticsearch 7.x.

Also note that running with 2 master eligible nodes does not give you any high availability as both nodes need to be available in order to elect a master. You need 3 master eligible nodes in order to be able to handle one node failing without resulting in a red cluster. Please see the official documentation for further details.

These things are true, and worth addressing in due course, but just to clarify that fixing them won't help with the immediate problem. Please focus on the storage performance issue.

1 Like

These are the latest logs from Master 1: not sure if its still related to performance only , Its just that same storage was working fine for last two years @DavidTurner can you please elaborate what I need to do?

[2023-08-05T18:55:36,805][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:37,037][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:37,362][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:37,625][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:39,437][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:40,659][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-08-05T18:55:40,911][WARN ][r.suppressed             ] [ES-Master-1] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.16.2.jar:7.16.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]

@DavidTurner we are using EFS as a storage for storing the data, we currently has 6 TB of data, if that can be the issue?

Oh yeah EFS would explain it, it's dreadfully slow sometimes. See these docs for more information.

@DavidTurner can you recommend what should I do now with config should I keep it the same way I have shared above?

You need to fix your dreadfully slow EFS storage. Either provision more IOPs, or move your data to something that performs adequately (e.g. EBS).

@DavidTurner are these config.yml correct for now? apart from storage do i need to change something here?

If you fix your storage then the cluster should come back to life. Some of your config is questionable (see Christian's earlier message for example) but none of that matters right now.

But how come it was working fine for almost two years and now its giving us the headache, is it due to data load as its has risen up to 6TB because earlier we used to clean it up every month and now for last 6 months we didn't performed the clean up?

and one more thing changes recommended by leandrojmp required to be done?

No idea. Nothing relevant has changed in Elasticsearch in a long time, so it must be something in your environment.

Yes, but again they won't fix your storage performance.