Failed to send ping to

ccrivelli · May 25, 2015, 1:29am

Hi guys,
after setting up my elasticsearch cluster, I got some 'failed to send ping' errors.
Then I increased the discovery.zen.ping.timeout from 3s (default) to 6s and finally to 30s, but I'm still getting the same error.
The cluster is configured with unicast discovery on port 9300 for each host (multicast disabled).
The error is not persistent, it just happens sometimes in the logs:

[2015-05-23 01:03:45,199][WARN ][discovery.zen.ping.unicast] [hostname] failed to send ping to [[#zen_unicast_2#][hostname.domain][inet[hostname/ip-address:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [hostname][inet[hostname/ip-address:9300]][internal:discovery/zen/unicast] request_id [58210] timed out after [37500ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:531)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Do you have some advices for better tuning elasticsearch in order to avoid this error?

Thanks

dadoonet · May 25, 2015, 2:35am

You could have network issues or Old GC is running for a long time?
Anything else in logs?
What is your heap size?

ccrivelli · May 25, 2015, 4:00am

Nothing else in the logs.
I'm using Java8, HEAP_SIZE=8g (half memory):

497       3719  5.1 54.3 12183504 8871880 ?    SLl  13:40   0:36 /usr/bin/java -Xms8g -Xmx8g ... ```

The cluster is new from scratch, like no more then 2-3 weeks old and there is no data yet.

dadoonet · May 25, 2015, 7:41pm

So it looks like a network issue in that case.
I don't see anything else then that. Is that something you can check on your end?

ccrivelli · May 26, 2015, 1:24am

Uhm..
I also installed Marvel yesterday and today I found several additional errors in the logs, like:

[2015-05-26 07:49:13,274][ERROR][marvel.agent.exporter    ] [hostname148] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
^E��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$^O��]

or

[2015-05-26 01:13:32,667][WARN ][cluster.action.shard     ] [hostname148] [.marvel-2015.05.25][0] sending failed shard for [.marvel-2015.05.25][0], node[sRKyvpnkSkGrWKG7npvLgw], [R], s[STARTED], indexUUID [U5eSpJQGRA2nc66Ll7nHug], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [SendRequestTransportException[[hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[hostname507][inet[/xx.xx.xx.35:9300]] Node not connected]; ]]
[2015-05-26 01:13:34,069][WARN ][action.bulk              ] [hostname148] Failed to perform indices:data/write/bulk[s] on remote replica [hostname507][sRKyvpnkSkGrWKG7npvLgw][hostname507.domain][inet[/xx.xx.xx.35:9300]][.marvel-2015.05.25][0]
org.elasticsearch.transport.SendRequestTransportException: [hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:284)

I also notice the cluster got stuck and not responding for almost 2 hours.
Like this morning at 9.10am I saw all the elasticsearch logs stuck at 7:51am, the elasticsearch process still running but only 1 node of 5 responding properly when i run a
curl 'http://localhost:9200/_cat/indices?v'

I found the strace of some elasticsearch threads:

# strace -p 17524
Process 17524 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71633, 270610901}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71634, 270610901}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71634, 270917231}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71635, 270917231}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

And, after a while the logs start to be written again and this is the piece of log written:

[2015-05-26 07:51:33,351][DEBUG][action.bulk              ] [hostname148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

[2015-05-26 07:51:33,355][ERROR][marvel.agent.exporter    ] [hostname148] create failure (index:[.marvel-2015.05.25] type: [node_stats]): ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

[2015-05-26 07:51:34,886][INFO ][cluster.service          ] [hostname148] detected_master [hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]], added {[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]],[hostname108][3dBX7Jw1TFCcHui738YwbA][hostname108][inet[/xx.xx.xx.138:9300]]{data=false, master=false},[hostname036][JpvivKubSPCWC_yWesH4rA][hostname036][inet[/xx.xx.xx.137:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]])

[2015-05-26 07:51:42,584][INFO ][cluster.service          ] [hostname148] added {[hostname272][pwa_oGcsTDeqaCwPN_Z1kg][hostname272.domain][inet[/xx.xx.xx.33:9300]],}, reason: zen-disco-receive(from master [[n036hd0l8383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.s4.chp.cba][inet[/1xx.xx.xx.34:9300]]])

[2015-05-26 09:24:33,617][INFO ][discovery.zen            ] [hostname148] master_left [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], reason [transport disconnected]

[2015-05-26 09:24:33,619][DEBUG][action.admin.cluster.state] [hostname148] connection exception while trying to forward request to master node [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [hostname383][inet[/xx.xx.xx.34:9300]][cluster:monitor/state] disconnected]

What is that observer: timeout notification from cluster service. timeout setting [1m], time since start [1m] ? Could be helpful?

I also think one problem (the ping timeout) could cause several others. Correct?

Now I uninstalled Marvel for having less errors in the logs and trying to solve one error per time, I also changed my unicast cluster configuration from "hostname:port" to "IP:port" because I want to avoid any potential (and/or temporary) hostname dns resolution issues.

Any advices/thought?

Thanks

ccrivelli · May 26, 2015, 4:56am

And now I changed it the configuration again using fully qualified domain and discovery.zen.minimum_master_nodes: 1 even if I have 5 server nodes + 2 client nodes'

Still getting the same error after a couple of hours:

[2015-05-26 13:33:37,024][WARN ][discovery.zen.ping.unicast] [hostname038.domain] failed to send ping to [[#zen_un
icast_3#][hostname038.domain][inet[hostname507.domain/xx.xx.xx.35:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [hostname507.domain][inet[/xx.xx.xx.35:9300]][inter
nal:discovery/zen/unicast] request_id [5521] timed out after [37501ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:531)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

and many more ..

What kind of check could I do over the network? (using tcpdump or other tools..)

Any advice is appreciated.

Thanks

virtuman · August 2, 2015, 3:55am

have you ever been able to resolve this unavailable/master/2 issue? getting the same errors, tried multicast enable/disable to same effect.
pretty much all nodes are currently active up, and showing at best 20%-40% of the actual data

Omar_Al_Zabir · December 22, 2015, 11:50am

I am having the same problem with ES 2.1.0. Nodes are unable to ping each other random time throughout the day and the nodes leave, logstash chokes.

I would love to know if you have been able to solve this problem.

Mobasher-NetLinks · March 28, 2016, 9:30am

Have you solved this, I have the same problem with 2.2.0

Amit_Sharma1 · September 16, 2016, 8:02am

Did you get solution of above issue? I am also facing similar issue regularly.
[discovery.zen.ping.unicast] [Machine-2] failed to send ping to [Machine2][][Machine-2][inet[/172.25.5.234:9300]]]

nsaravanas · September 23, 2016, 5:28am

It could be due to multiple instance of ES running on same machine, a corrupted/killed ES has not released the port, check the instances running by executing jps, if more than one is running, kill the one which is not releasing the port.

Regards,
Saravana

Topic		Replies	Views
Failed to send ping to zen_unicast_1 Elasticsearch	5	3059	July 6, 2017
Exception at startup: failed to send ping request over multicast Elasticsearch	7	1162	July 6, 2017
Unicast discovery fails to connect to master Elasticsearch	7	2992	July 6, 2017
Setup with Unix Elasticsearch	4	272	July 6, 2017
Transport error Elasticsearch	5	1737	July 6, 2017

Failed to send ping to

Related topics