I recently upgraded from ES2.4.1 to ES 5.0 shortly after i started to notice that nodes where dropping from the cluster and or having erratic behavior. in all my nodes i see the below erros shortly after the nodes start up
2016-11-03T10:10:44,929][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [essep01] ignoring unexpected response [null] of type [null], expected [NodeInfo] or [FailedNodeException]
[2016-11-03T10:11:03,883][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [36539ms] ago, timed out [6539ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12485911]
[2016-11-03T10:11:31,813][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [48886ms] ago, timed out [46886ms] ago, action [cluster:monitor/nodes/info[n]], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12485914]
[2016-11-03T10:12:11,777][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [39118ms] ago, timed out [9118ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12486787]
[2016-11-03T10:13:17,141][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [36637ms] ago, timed out [6636ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12487505]
[2016-11-03T10:14:24,595][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [38975ms] ago, timed out [8975ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12488702]
[2016-11-03T10:15:30,329][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [39101ms] ago, timed out [9101ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12489606]
[2016-11-03T10:15:46,948][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [essep01] failed to execute on node [XW_QxbVpRwSir7QEIyTJJg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [essep03][192.168.10.88:9300][cluster:monitor/nodes/info[n]] request_id [12490115] timed out after [2001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]Preformatted text
is this behaviour happening all the time or only once? Does this node join the cluster at all? Can you ping the other elasticsearch nodes from that erratic one and response times are good?
I get the exact same behavior in 5.0.0 and 5.0.1, whenever I restart a node, I get this in the logs :
[2016-11-16T10:02:03,895][INFO ][o.e.n.Node ] [es-log2-data1] started
[2016-11-16T10:02:36,926][INFO ][o.e.d.z.ZenDiscovery ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:03:39,939][INFO ][o.e.d.z.ZenDiscovery ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:04:42,953][INFO ][o.e.d.z.ZenDiscovery ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:05:45,968][INFO ][o.e.d.z.ZenDiscovery ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:06:07,251][INFO ][o.e.d.z.ZenDiscovery ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [NodeDisconnectedException[[es-log3-data1][10.21.1.143:9300][internal:discovery/zen/join] disconnected]]
[2016-11-16T10:06:13,166][INFO ][o.e.c.s.ClusterService ] [es-log2-data1] detected_master {web4-master}
Not sure if its related, but I had a watch on _cluster/health, and seems the request stuck at the time my cluster node joined (I just restarted my watch and was fine).
Also, the node joined right after the cluster state switched from red to yellow.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.