8 node setup , 2 nodes not joining the cluster

we are running 2 instances of elasticsearch (6.6.1 version) in 4 nodes forming 8 nodes cluster setup. When we check the cluster health , the output showing only 6 nodes. 2 nodes are not joining the cluster . and we see the below error in cluster logs. Please help to resolve .

in master node cluster log:

[2022-08-10T09:31:35,990][WARN ][o.e.d.z.ZenDiscovery     ] [10.47.91.107_NODE-0] failed to validate incoming join request from node [{10.47.91.109_NODE-1}{hxAzLEDRQCOj-e8O7COl1Q}{0PFEpimfSZuos3Hct_kJdQ}{10.47.91.109}{10.47.91.109:7003}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
                at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:62) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.MembershipAction.sendValidateJoinRequestBlocking(MembershipAction.java:106) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:888) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:1135) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:142) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:138) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) [x-pack-security-6.6.1.jar:6.6.1]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:308) [x-pack-security-6.6.1.jar:6.6.1]

cluster logs of the node which are not joined to cluster

[2022-08-10T09:23:00,778][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.47.91.109_NODE-0] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2022-08-10T09:23:00,839][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.237.92.109_NODE-0] no known master node, scheduling a retry
[2022-08-10T09:23:11,812][INFO ][o.e.d.z.ZenDiscovery     ] [10.47.91.109_NODE-0] failed to send join request to master [{10.47.91.107_NODE-0}{0VL8bYl2TGunp4X_qNKTxw}{yoWIdXPNS4iEXO0_9OnyDQ}{10.47.91.107}{10.47.91.107:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2022-08-10T09:23:11,819][WARN ][o.e.d.z.UnicastZenPing   ] [10.47.91.109_NODE-0] failed to resolve host ['']
java.net.UnknownHostException: '': Name or service not known
                at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_211]
                at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_211]
                at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_211]
                at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_211]
                at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211]
                at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_211]
                at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:550) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:503) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:738) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.6.1.jar:6.6.1]
                at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_211]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
[2022-08-10T09:23:20,735][WARN ][r.suppressed             ] [10.47.91.109_NODE-0] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
                at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:458) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:337) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]

Welcome to our community! :smiley:

This is very much past EOL and no longer supported, you need to upgrade as a matter of urgency.

This would suggest either a configuration or DNS error. So what does your elasticsearch.yml look like?

Thanks for the reply. What DNS problem you are referring to? cause nothing changed in the configuration as such. This process running smooth from last 1 and half year. The problem observed suddenly from morning. we took couple of restart but no luck.

This is important please.

Below is my elasticsearch.yml file . Note that we have configured environment variable (ex:ES_ZEN_HOSTS2,ES_TCP_PORT) defined in elasticsearch.service file for paths and IP's and port. Please suggest if anything wrong in this.


# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: home-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ${ES_NODE_NAME}
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: ["/data1/elasticsearch${ES_PATH_DATA}","/data2/elasticsearch${ES_PATH_DATA}","/data3/elasticsearch${ES_PATH_DATA}"]
#
#
# Path to log files:
#
path.logs: ${ES_PATH_LOGS}
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
#
#----------------------------------Thread Pool --------------------------------
#
thread_pool.bulk.queue_size: 500
thread_pool.index.queue_size: 1000
#
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 10.47.91.107
network.publish_host: 10.47.91.107
#
# Set a custom port for HTTP:
#
http.port: ${ES_HTTP_PORT}
transport.tcp.port: ${ES_TCP_PORT}
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
discovery.zen.ping.unicast.hosts: ["${ES_ZEN_HOSTS1:''}","${ES_ZEN_HOSTS2:''}","${ES_ZEN_HOSTS3:''}","${ES_ZEN_HOSTS4:''}","${ES_ZEN_HOSTS5:''}","${ES_ZEN_HOSTS6:''}","${ES_ZEN_HOSTS7:''}","${ES_ZEN_HOSTS8:''}"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 5
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true

Changing these is not a great idea.

Given these are the same, you only need network.host

This is the issue. It's possible that one of these variables is not correctly populated, and so Elasticsearch is being told the hostname is blank, hence the error.

This configuration is not changed from long time though. Even ping and telnet commands are fine from each server . So there shouldn't be connectivity issue. Let me change this to IP and port and update you. Meanwhile can u help us understanding any other things are we missing?

Other than running an EOL version, no.

Still not working.. tried with IPs in discovery.zen.ping.unicast.hosts configuration also . now cluster health is giving below error

{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

Please share the logs.

its keep on writing below error in cluster log

[2022-08-10T14:07:46,116][WARN ][o.e.d.z.ZenDiscovery     ] [10.47.91.107_NODE-0] failed to validate incoming join request from node [{10.47.91.109_NODE-0}{RQ50cjGaTJyDflmJqQKj5w}{2LGpDsFlT



hu_q6AxI3Kx1g}{10.47.91.109}{10.47.91.109:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]



[2022-08-10T14:08:48,711][WARN ][o.e.d.z.ZenDiscovery     ] [10.47.91.107_NODE-0] failed to validate incoming join request from node [{10.47.91.109_NODE-1}{hxAzLEDRQCOj-e8O7COl1Q}{5FqECGZ7S



1auxQAei1kSpg}{10.47.91.109}{10.47.91.109:7003}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]



[2022-08-10T14:08:49,120][WARN ][o.e.d.z.ZenDiscovery     ] [10.47.91.107_NODE-0] failed to validate incoming join request from node [{10.47.91.109_NODE-0}{RQ50cjGaTJyDflmJqQKj5w}{2LGpDsFlT



hu_q6AxI3Kx1g}{10.47.91.109}{10.47.91.109:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]

They are warnings, not errors.

What do the logs on the 10.47.91.109 node show?

[2022-08-10T21:20:42,343][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.X.X.109_NODE-0] no known master node, scheduling a retry
[2022-08-10T21:20:58,640][INFO ][o.e.d.z.ZenDiscovery     ] [10.X.X.109_NODE-0] failed to send join request to master [{10.X.X.107_NODE-0}{0VL8bYl2TGunp4X_qNKTxw}{P6nJo-32Qdeh1kpCPbAy2Q}{10.X.X.107}{10.X.X.107:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2022-08-10T21:21:01,850][WARN ][r.suppressed             ] [10.X.X.109_NODE-0] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
        at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:458) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:337) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:492) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
                Suppressed: org.elasticsearch.discovery.MasterNotDiscoveredException
                at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) ~[elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.6.1.jar:6.6.1]
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
[2022-08-10T21:21:06,888][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.X.X.109_NODE-0] timed out while retrying [indices:admin/create] after failure (timeout [1m])

Masked IP for privacy reason.

also Observed TIME_WAIT is coming in telnet

telnet -anp| grep 7001

o/p for the elasticsearch port from non working node. Does this have anything with the issue?

tcp6       0      0 :::7001                 :::*                    LISTEN      44929/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13098     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:19534     10.X.X.108:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:63270     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:22391     10.X.X.110:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:63272     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:22253     10.X.X.110:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13088     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:63274     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:63264     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:45774     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:63268     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:45754     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:45760     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:45764     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13078     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:48629     10.X.X.107:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13090     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:45770     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13100     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13084     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:63266     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:63282     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:63278     10.X.X.107:7001      ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:45772     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:45762     10.X.X.107:7001      ESTABLISHED 45395/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13086     ESTABLISHED 44929/java          
tcp6       0      0 10.X.X.109:7001      10.X.X.107:13080     ESTABLISHED 44929/java

also 107 is master and 109 is not joining and 108 , 110 server are other nodes which have joined the cluster. the netstat output from 109 server is below

netstat -anp| grep 7001 | grep -v  10.X.X.107
tcp6       0      0 :::7001                 :::*                    LISTEN      44929/java          
tcp6       0      0 10.X.X.109:46847     10.X.X.110:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:43998     10.X.X.108:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:52345     10.X.X.109:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:43916     10.X.X.108:7001      TIME_WAIT   -                   
tcp6       0      0 10.X.X.109:46773     10.X.X.110:7001      TIME_WAIT   -  

Can you please help me out here . I am stuck actually on what to do next.

tcp dump on elasticsearch port is showing multiple TCP Retransmission and TCP Previous segment not captured packets between master node and non-joined node. However ping and telnet is fine between those 2 server. @warkolm what could be the reason? can we delete some indices when cluster is red status ? will it corrupt the index once the non-active node joins the cluster again after sometime? I mean now 6 nodes are present in the cluster instead of 8 right now . is it safe to delete some indexes while 2 nodes are absent in the cluster?

This is just a wild guess but are you sure firewall configuration is identical for all the hosts?

all firewalls are disabled only as per cloud team. Any commands you want me to check further?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.