ES 5 upgrade created Received response for a request that has timed out


(Alvaro Cabrera) #1

I recently upgraded from ES2.4.1 to ES 5.0 shortly after i started to notice that nodes where dropping from the cluster and or having erratic behavior. in all my nodes i see the below erros shortly after the nodes start up

2016-11-03T10:10:44,929][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [essep01] ignoring unexpected response [null] of type [null], expected [NodeInfo] or [FailedNodeException]
[2016-11-03T10:11:03,883][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [36539ms] ago, timed out [6539ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12485911]
[2016-11-03T10:11:31,813][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [48886ms] ago, timed out [46886ms] ago, action [cluster:monitor/nodes/info[n]], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12485914]
[2016-11-03T10:12:11,777][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [39118ms] ago, timed out [9118ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12486787]
[2016-11-03T10:13:17,141][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [36637ms] ago, timed out [6636ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12487505]
[2016-11-03T10:14:24,595][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [38975ms] ago, timed out [8975ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12488702]
[2016-11-03T10:15:30,329][WARN ][o.e.t.TransportService ] [essep01] Received response for a request that has timed out, sent [39101ms] ago, timed out [9101ms] ago, action [internal:discovery/zen/fd/master_ping], node [{essep03}{XW_QxbVpRwSir7QEIyTJJg}{PF7FuFwzTSqqNQ7QUt-wyQ}{192.168.10.88}{192.168.10.88:9300}], id [12489606]
[2016-11-03T10:15:46,948][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [essep01] failed to execute on node [XW_QxbVpRwSir7QEIyTJJg]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [essep03][192.168.10.88:9300][cluster:monitor/nodes/info[n]] request_id [12490115] timed out after [2001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:840) [elasticsearch-5.0.0.jar:5.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:444) [elasticsearch-5.0.0.jar:5.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]Preformatted text

Please help


(Alvaro Cabrera) #2

i notice that only one node in the cluster has this behavior.

[2016-11-03T13:37:03,284][INFO ][o.e.n.Node ] [essep03] starting ...
[2016-11-03T13:37:03,408][INFO ][o.e.t.TransportService ] [essep03] publish_address {192.168.10.88:9300}, bound_addresses {192.168.10.88:9300}, {[::1]:9300}, {127.0.0.1:9300}
[2016-11-03T13:37:03,412][INFO ][o.e.b.BootstrapCheck ] [essep03] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2016-11-03T13:37:33,430][WARN ][o.e.n.Node ] [essep03] timed out while waiting for initial discovery state - timeout: 30s
[2016-11-03T13:37:33,446][INFO ][o.e.h.HttpServer ] [essep03] publish_address {192.168.10.88:9200}, bound_addresses {192.168.10.88:9200}, {[::1]:9200}, {127.0.0.1:9200}
[2016-11-03T13:37:33,446][INFO ][o.e.n.Node ] [essep03] started
[2016-11-03T13:38:06,478][INFO ][o.e.d.z.ZenDiscovery ] [essep03] failed to send join request to master [{essep02}{87WsenQBTLiL7Hjd-dWXUA}{fAQLvpE4S9CKDYbrYsY7aw}{192.168.10.81}{192.168.10.81:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-03T13:39:09,488][INFO ][o.e.d.z.ZenDiscovery ] [essep03] failed to send join request to master [{essep02}{87WsenQBTLiL7Hjd-dWXUA}{fAQLvpE4S9CKDYbrYsY7aw}{192.168.10.81}{192.168.10.81:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.];


(Alexander Reelsen) #3

Hey,

is this behaviour happening all the time or only once? Does this node join the cluster at all? Can you ping the other elasticsearch nodes from that erratic one and response times are good?

--Alex


(Alvaro Cabrera) #4

actually the cluster healed it self, I just had to let all the shard relocate and then join the node... weir behavior


(Maxence Dunnewind) #5

I get the exact same behavior in 5.0.0 and 5.0.1, whenever I restart a node, I get this in the logs :

[2016-11-16T10:02:03,895][INFO ][o.e.n.Node               ] [es-log2-data1] started
[2016-11-16T10:02:36,926][INFO ][o.e.d.z.ZenDiscovery     ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested:     TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:03:39,939][INFO ][o.e.d.z.ZenDiscovery     ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:04:42,953][INFO ][o.e.d.z.ZenDiscovery     ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:05:45,968][INFO ][o.e.d.z.ZenDiscovery     ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [ElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]
[2016-11-16T10:06:07,251][INFO ][o.e.d.z.ZenDiscovery     ] [es-log2-data1] failed to send join request to master [{es-log3-data1}{pdNDaSyHQLeYB8aH2F0buw}{_Mps7c9_Sj6s3JQEzgBFUA}{10.21.1.143}{10.21.1.143:9300}], reason [NodeDisconnectedException[[es-log3-data1][10.21.1.143:9300][internal:discovery/zen/join] disconnected]]
[2016-11-16T10:06:13,166][INFO ][o.e.c.s.ClusterService   ] [es-log2-data1] detected_master {web4-master} 

Not sure if its related, but I had a watch on _cluster/health, and seems the request stuck at the time my cluster node joined (I just restarted my watch and was fine).

Also, the node joined right after the cluster state switched from red to yellow.


(system) #6