I have a cluster with 4 nodes , 2 master and 2 data nodes, currently only one master is responding, all the other nodes connect for 10 mins and shard allocation start, then again it becomes NA , not sure what is the issue?

I have a cluster with 4 nodes , 2 master and 2 data nodes, currently only one master is responding, all the other nodes connect for 10 mins and shard allocation start, then again it becomes NA , not sure what is the issue?

[2023-07-23T22:22:11,855][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.5.175:9840]] opened probe connection
[2023-07-23T22:22:11,855][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.5.175:9840]] handshake successful: {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{SoRVzQJkQX6yx4JsMMzyig}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
[2023-07-23T22:22:11,855][DEBUG][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.5.175:9840], node [null], requesting [false] connection failed
org.elasticsearch.transport.ConnectTransportException: [ES-Aggr-2][10.182.5.175:9840] non-master-eligible node found
	at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.innerOnResponse(HandshakingTransportAddressConnector.java:115) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.innerOnResponse(HandshakingTransportAddressConnector.java:103) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:29) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:340) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:328) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-07-23T22:22:13,464][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] startProbe(10.182.5.183:9840) not probing local node
[2023-07-23T22:22:13,636][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.4.184:9840], node [{ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{o_TwPom5THemHV1O4YbmzA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}], requesting [false] requesting peers
[2023-07-23T22:22:13,636][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] probing master nodes from cluster state: nodes: 
   {ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{o_TwPom5THemHV1O4YbmzA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}, master
   {ES-Master-2}{nfO_NdY6QdK5PWI1-Qj6Ew}{9ajB76fiTUiXnmp0RzZNug}{ES-Master-2}{10.182.5.183:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}, local
   {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{SoRVzQJkQX6yx4JsMMzyig}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
   {ES-Aggr-1}{beiqGC2KTeuwlx4ZOvkulQ}{0py7SdmPS6iXVIdFVF3oiQ}{ES-Aggr-1}{10.182.4.170:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}

[2023-07-23T22:22:13,636][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] startProbe(10.182.5.183:9840) not probing local node
[2023-07-23T22:22:13,636][TRACE][o.e.d.SeedHostsResolver  ] [ES-Master-2] resolved host [ES-Master-1] to [10.182.4.184:9840]
[2023-07-23T22:22:13,637][TRACE][o.e.d.SeedHostsResolver  ] [ES-Master-2] resolved host [ES-Master-2] to [10.182.5.183:9840]
[2023-07-23T22:22:13,637][TRACE][o.e.d.SeedHostsResolver  ] [ES-Master-2] resolved host [ES-Aggr-1] to [10.182.4.170:9840]
[2023-07-23T22:22:13,637][TRACE][o.e.d.SeedHostsResolver  ] [ES-Master-2] resolved host [ES-Aggr-2] to [10.182.5.175:9840]
[2023-07-23T22:22:13,637][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] probing resolved transport addresses [10.182.4.184:9840, 10.182.4.170:9840, 10.182.5.175:9840]
[2023-07-23T22:22:13,637][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.4.170:9840], node [null], requesting [false] attempting connection
[2023-07-23T22:22:13,637][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.5.175:9840], node [null], requesting [false] attempting connection
[2023-07-23T22:22:13,637][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.4.170:9840]] opening probe connection
[2023-07-23T22:22:13,637][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.5.175:9840]] opening probe connection
[2023-07-23T22:22:13,638][TRACE][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.4.184:9840], node [{ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{o_TwPom5THemHV1O4YbmzA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}], requesting [true] received PeersResponse{masterNode=Optional[{ES-Master-1}{GW9UYH3tSR2LiMUTedmPuw}{o_TwPom5THemHV1O4YbmzA}{ES-Master-1}{10.182.4.184:9840}{cdfhilmrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}], knownPeers=[], term=2357}
[2023-07-23T22:22:13,651][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.4.170:9840]] opened probe connection
[2023-07-23T22:22:13,651][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.5.175:9840]] opened probe connection
[2023-07-23T22:22:13,728][TRACE][o.e.d.HandshakingTransportAddressConnector] [ES-Master-2] [connectToRemoteMasterNode[10.182.5.175:9840]] handshake successful: {ES-Aggr-2}{sfViGaurRaecVCv2RfDTig}{SoRVzQJkQX6yx4JsMMzyig}{ES-Aggr-2}{10.182.5.175:9840}{cdfhilrstw}{ml.machine_memory=16495927296, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=2147483648, transform.node=true}
[2023-07-23T22:22:13,729][DEBUG][o.e.d.PeerFinder         ] [ES-Master-2] address [10.182.5.175:9840], node [null], requesting [false] connection failed
org.elasticsearch.transport.ConnectTransportException: [ES-Aggr-2][10.182.5.175:9840] non-master-eligible node found
	at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.innerOnResponse(HandshakingTransportAddressConnector.java:115) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.discovery.HandshakingTransportAddressConnector$1$1.innerOnResponse(HandshakingTransportAddressConnector.java:103) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:29) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService.lambda$handshake$9(TransportService.java:577) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:340) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:328) [elasticsearch-7.16.2.jar:7.16.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.16.2.jar:7.16.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

You have 2 master dedicated nodes and 2 data nodes? With 2 master nodes only you do not have any resilience, if one of the dedicated master is not working the entire cluster will be down.

To have resilience you need at least 3 master-eligible nodes.

Please look at my master node yml file and help me with what configuration I can do...

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: icon-es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ES-Master-1
node.master: true
node.data: true
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/els/ES-Master-1/data
#
# Path to log files:
#
path.logs: /mnt/els/ES-Master-1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: ES-Master-1
#
# Set a custom port for HTTP:
#
http.port: 9720
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
#cluster.initial_master_nodes: ["node-1", "node-2"]
cluster.initial_master_nodes: ["ES-Master-1", "ES-Master-2"]
cluster.routing.allocation.node_concurrent_incoming_recoveries: 200
cluster.routing.allocation.node_concurrent_recoveries: 200
cluster.routing.allocation.node_initial_primaries_recoveries: 200
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
logger.org.elasticsearch.cluster.coordination.ClusterBootstrapService: TRACE
logger.org.elasticsearch.discovery: TRACE
thread_pool.write.queue_size: 1000
transport.tcp.port: 9840
#SSL--------------------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.type: PKCS12
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.type: PKCS12
xpack.security.http.ssl.enabled: true
######
#xpack.security.http.ssl.keystore.type: PKCS12
#xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.type: PKCS12
#xpack.security.http.ssl.client_authentication: optional
###network.publish_host: ES-Master-1
#######
xpack.security.http.ssl.keystore.type: PKCS12
xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.type: PKCS12
xpack.security.http.ssl.client_authentication: optional
#
discovery.zen.minimum_master_nodes: 2
path.repo: /mnt/els/els-snapshots

Are both the nodes Es-Master-1 and Es-Master-2 running? You need both nodes to be running and you will also need to create a new master node.

There is not much to do, you need to bring the node that is not running back online and after that add a new master node to have resilience.

Also, with the following configuration the Es-Master-1 node is not an dedicated master, it is also a data node. It is the same for Es-Master-2? And what does the elasticsearch.yml for the Es-Aggr nodes looks like?

From the log you shared it seems that the nodes Es-Master-1 and Es-Master-2 are both master and data nodes, but the nodes Es-Aggr-1 and Es-Aggr-2 are only data nodes.

This setting needs to have only the master eligible nodes, remove the Es-Aggr nodes if they are data only.

Any reason to have changed those settings? The default value is 2, this is way too high and can heavily impact on recoveries.

Please find the logs for Aggr node, and FYI this cluster configuration was running fine for last 2 years, so I think the master configuration is fine.

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: icon-es
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ES-Aggr-1
node.master: false
node.data: true
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/els/ES-Aggr-1/data
#
# Path to log files:
#
path.logs: /mnt/els/ES-Aggr-1/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: ES-Aggr-1
#
# Set a custom port for HTTP:
#
http.port: 9720
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]
#
# Bootstrap the cluster using an initial set of master-eligible nodes:
#
cluster.initial_master_nodes: ["ES-Master-1", "ES-Master-2"]
cluster.routing.allocation.node_concurrent_incoming_recoveries: 200
cluster.routing.allocation.node_concurrent_recoveries: 200
cluster.routing.allocation.node_initial_primaries_recoveries: 200
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
logger.org.elasticsearch.cluster.coordination.ClusterBootstrapService: TRACE
logger.org.elasticsearch.discovery: TRACE
thread_pool.write.queue_size: 1000
transport.tcp.port: 9840
#SSL----------------------------------------------------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.type: PKCS12
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
xpack.security.transport.ssl.truststore.type: PKCS12
xpack.security.http.ssl.enabled: true
#xpack.security.http.ssl.keystore.type: PKCS12
#xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/elastic-certificates.p12
#xpack.security.http.ssl.truststore.type: PKCS12
#xpack.security.http.ssl.client_authentication: optional

xpack.security.http.ssl.keystore.type: PKCS12
xpack.security.http.ssl.keystore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.path: /opt/apm/elasticsearch/config/wildcard_cev_vic_edu_au.p12
xpack.security.http.ssl.truststore.type: PKCS12
xpack.security.http.ssl.client_authentication: optional
#xpack.security.http.ssl.certificate_authorities: [ "/opt/apm/elasticsearch/config/2021/DigiCertCA.crt" ]
path.repo: /mnt/els/els-snapshots

not logs config file for Aggr

Not sure what I could add besides what was already said.

You need to have your two master nodes running, if one of the node is not working you need to bring it back online.

Thanks for your quick response.

Thats the issue I am facing thatbother master node is coming up for 10 mins and then again it goes down...

The logs you shared do not show that, it only shows that the Es-Master-2 node is looking for another master node, but it could not find.

You need to change this setting:

discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]

To include only the master nodes, this needs to be changed on both Master nodes.

Then start the Es-Master-1 and Es-Master-2 nodes and share the logs.

discovery.seed_hosts: ["ES-Master-1", "ES-Master-2", "ES-Aggr-1", "ES-Aggr-2"]

Should I remove this setting from ES-Aggr-1", "ES-Aggr-2" as well?


It seems somehow cluster is up and running just want to understand how much time it will take for shard distribution?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.