we are running 2 instances of elasticsearch (6.6.1 version) in 4 nodes forming 8 nodes cluster setup. When we check the cluster health , the output showing only 6 nodes. 2 nodes are not joining the cluster . and we see the below error in cluster logs. Please help to resolve .
in master node cluster log:
[2022-08-10T09:31:35,990][WARN ][o.e.d.z.ZenDiscovery ] [10.47.91.107_NODE-0] failed to validate incoming join request from node [{10.47.91.109_NODE-1}{hxAzLEDRQCOj-e8O7COl1Q}{0PFEpimfSZuos3Hct_kJdQ}{10.47.91.109}{10.47.91.109:7003}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:62) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.MembershipAction.sendValidateJoinRequestBlocking(MembershipAction.java:106) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:888) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:1135) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:142) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:138) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) [x-pack-security-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:308) [x-pack-security-6.6.1.jar:6.6.1]
cluster logs of the node which are not joined to cluster
[2022-08-10T09:23:00,778][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.47.91.109_NODE-0] timed out while retrying [indices:admin/create] after failure (timeout [1m])
[2022-08-10T09:23:00,839][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.237.92.109_NODE-0] no known master node, scheduling a retry
[2022-08-10T09:23:11,812][INFO ][o.e.d.z.ZenDiscovery ] [10.47.91.109_NODE-0] failed to send join request to master [{10.47.91.107_NODE-0}{0VL8bYl2TGunp4X_qNKTxw}{yoWIdXPNS4iEXO0_9OnyDQ}{10.47.91.107}{10.47.91.107:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2022-08-10T09:23:11,819][WARN ][o.e.d.z.UnicastZenPing ] [10.47.91.109_NODE-0] failed to resolve host ['']
java.net.UnknownHostException: '': Name or service not known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) ~[?:1.8.0_211]
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) ~[?:1.8.0_211]
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) ~[?:1.8.0_211]
at java.net.InetAddress.getAllByName0(InetAddress.java:1277) ~[?:1.8.0_211]
at java.net.InetAddress.getAllByName(InetAddress.java:1193) ~[?:1.8.0_211]
at java.net.InetAddress.getAllByName(InetAddress.java:1127) ~[?:1.8.0_211]
at org.elasticsearch.transport.TcpTransport.parse(TcpTransport.java:550) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.transport.TcpTransport.addressesFromString(TcpTransport.java:503) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.transport.TransportService.addressesFromString(TransportService.java:738) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.discovery.zen.UnicastZenPing.lambda$resolveHostsLists$0(UnicastZenPing.java:189) ~[elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_211]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
[2022-08-10T09:23:20,735][WARN ][r.suppressed ] [10.47.91.109_NODE-0] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:458) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:337) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]
Thanks for the reply. What DNS problem you are referring to? cause nothing changed in the configuration as such. This process running smooth from last 1 and half year. The problem observed suddenly from morning. we took couple of restart but no luck.
Below is my elasticsearch.yml file . Note that we have configured environment variable (ex:ES_ZEN_HOSTS2,ES_TCP_PORT) defined in elasticsearch.service file for paths and IP's and port. Please suggest if anything wrong in this.
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: home-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: ${ES_NODE_NAME}
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: ["/data1/elasticsearch${ES_PATH_DATA}","/data2/elasticsearch${ES_PATH_DATA}","/data3/elasticsearch${ES_PATH_DATA}"]
#
#
# Path to log files:
#
path.logs: ${ES_PATH_LOGS}
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
#
#----------------------------------Thread Pool --------------------------------
#
thread_pool.bulk.queue_size: 500
thread_pool.index.queue_size: 1000
#
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 10.47.91.107
network.publish_host: 10.47.91.107
#
# Set a custom port for HTTP:
#
http.port: ${ES_HTTP_PORT}
transport.tcp.port: ${ES_TCP_PORT}
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
discovery.zen.ping.unicast.hosts: ["${ES_ZEN_HOSTS1:''}","${ES_ZEN_HOSTS2:''}","${ES_ZEN_HOSTS3:''}","${ES_ZEN_HOSTS4:''}","${ES_ZEN_HOSTS5:''}","${ES_ZEN_HOSTS6:''}","${ES_ZEN_HOSTS7:''}","${ES_ZEN_HOSTS8:''}"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 5
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
Given these are the same, you only need network.host
This is the issue. It's possible that one of these variables is not correctly populated, and so Elasticsearch is being told the hostname is blank, hence the error.
This configuration is not changed from long time though. Even ping and telnet commands are fine from each server . So there shouldn't be connectivity issue. Let me change this to IP and port and update you. Meanwhile can u help us understanding any other things are we missing?
[2022-08-10T21:20:42,343][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.X.X.109_NODE-0] no known master node, scheduling a retry
[2022-08-10T21:20:58,640][INFO ][o.e.d.z.ZenDiscovery ] [10.X.X.109_NODE-0] failed to send join request to master [{10.X.X.107_NODE-0}{0VL8bYl2TGunp4X_qNKTxw}{P6nJo-32Qdeh1kpCPbAy2Q}{10.X.X.107}{10.X.X.107:7001}{ml.machine_memory=84290285568, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2022-08-10T21:21:01,850][WARN ][r.suppressed ] [10.X.X.109_NODE-0] path: /_bulk, params: {}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:166) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:458) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:337) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:492) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
Suppressed: org.elasticsearch.discovery.MasterNotDiscoveredException
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$4.onTimeout(TransportMasterNodeAction.java:262) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:322) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:249) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:564) [elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]
[2022-08-10T21:21:06,888][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [10.X.X.109_NODE-0] timed out while retrying [indices:admin/create] after failure (timeout [1m])
Masked IP for privacy reason.
also Observed TIME_WAIT is coming in telnet
telnet -anp| grep 7001
o/p for the elasticsearch port from non working node. Does this have anything with the issue?
tcp6 0 0 :::7001 :::* LISTEN 44929/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13098 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:19534 10.X.X.108:7001 TIME_WAIT -
tcp6 0 0 10.X.X.109:63270 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:22391 10.X.X.110:7001 TIME_WAIT -
tcp6 0 0 10.X.X.109:63272 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:22253 10.X.X.110:7001 TIME_WAIT -
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13088 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:63274 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:63264 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:45774 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:63268 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:45754 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:45760 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:45764 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13078 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:48629 10.X.X.107:7001 TIME_WAIT -
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13090 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:45770 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13100 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13084 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:63266 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:63282 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:63278 10.X.X.107:7001 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:45772 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:45762 10.X.X.107:7001 ESTABLISHED 45395/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13086 ESTABLISHED 44929/java
tcp6 0 0 10.X.X.109:7001 10.X.X.107:13080 ESTABLISHED 44929/java
also 107 is master and 109 is not joining and 108 , 110 server are other nodes which have joined the cluster. the netstat output from 109 server is below
tcp dump on elasticsearch port is showing multiple TCP Retransmission and TCP Previous segment not captured packets between master node and non-joined node. However ping and telnet is fine between those 2 server. @warkolm what could be the reason? can we delete some indices when cluster is red status ? will it corrupt the index once the non-active node joins the cluster again after sometime? I mean now 6 nodes are present in the cluster instead of 8 right now . is it safe to delete some indexes while 2 nodes are absent in the cluster?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.