Failed to send ping to

Hi guys,
after setting up my elasticsearch cluster, I got some 'failed to send ping' errors.
Then I increased the discovery.zen.ping.timeout from 3s (default) to 6s and finally to 30s, but I'm still getting the same error.
The cluster is configured with unicast discovery on port 9300 for each host (multicast disabled).
The error is not persistent, it just happens sometimes in the logs:

[2015-05-23 01:03:45,199][WARN ][discovery.zen.ping.unicast] [hostname] failed to send ping to [[#zen_unicast_2#][hostname.domain][inet[hostname/ip-address:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [hostname][inet[hostname/ip-address:9300]][internal:discovery/zen/unicast] request_id [58210] timed out after [37500ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:531)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Do you have some advices for better tuning elasticsearch in order to avoid this error?

Thanks

You could have network issues or Old GC is running for a long time?
Anything else in logs?
What is your heap size?

Nothing else in the logs.
I'm using Java8, HEAP_SIZE=8g (half memory):

497       3719  5.1 54.3 12183504 8871880 ?    SLl  13:40   0:36 /usr/bin/java -Xms8g -Xmx8g ... ```

The cluster is new from scratch, like no more then 2-3 weeks old and there is no data yet.

So it looks like a network issue in that case.
I don't see anything else then that. Is that something you can check on your end?

Uhm..
I also installed Marvel yesterday and today I found several additional errors in the logs, like:

[2015-05-26 07:49:13,274][ERROR][marvel.agent.exporter    ] [hostname148] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
^E��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$^O��]

or

[2015-05-26 01:13:32,667][WARN ][cluster.action.shard     ] [hostname148] [.marvel-2015.05.25][0] sending failed shard for [.marvel-2015.05.25][0], node[sRKyvpnkSkGrWKG7npvLgw], [R], s[STARTED], indexUUID [U5eSpJQGRA2nc66Ll7nHug], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [SendRequestTransportException[[hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[hostname507][inet[/xx.xx.xx.35:9300]] Node not connected]; ]]
[2015-05-26 01:13:34,069][WARN ][action.bulk              ] [hostname148] Failed to perform indices:data/write/bulk[s] on remote replica [hostname507][sRKyvpnkSkGrWKG7npvLgw][hostname507.domain][inet[/xx.xx.xx.35:9300]][.marvel-2015.05.25][0]
org.elasticsearch.transport.SendRequestTransportException: [hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:284)

I also notice the cluster got stuck and not responding for almost 2 hours.
Like this morning at 9.10am I saw all the elasticsearch logs stuck at 7:51am, the elasticsearch process still running but only 1 node of 5 responding properly when i run a
curl 'http://localhost:9200/_cat/indices?v'

I found the strace of some elasticsearch threads:

# strace -p 17524
Process 17524 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71633, 270610901}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71634, 270610901}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71634, 270917231}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71635, 270917231}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

And, after a while the logs start to be written again and this is the piece of log written:

[2015-05-26 07:51:33,351][DEBUG][action.bulk              ] [hostname148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

[2015-05-26 07:51:33,355][ERROR][marvel.agent.exporter    ] [hostname148] create failure (index:[.marvel-2015.05.25] type: [node_stats]): ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]

[2015-05-26 07:51:34,886][INFO ][cluster.service          ] [hostname148] detected_master [hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]], added {[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]],[hostname108][3dBX7Jw1TFCcHui738YwbA][hostname108][inet[/xx.xx.xx.138:9300]]{data=false, master=false},[hostname036][JpvivKubSPCWC_yWesH4rA][hostname036][inet[/xx.xx.xx.137:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]])

[2015-05-26 07:51:42,584][INFO ][cluster.service          ] [hostname148] added {[hostname272][pwa_oGcsTDeqaCwPN_Z1kg][hostname272.domain][inet[/xx.xx.xx.33:9300]],}, reason: zen-disco-receive(from master [[n036hd0l8383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.s4.chp.cba][inet[/1xx.xx.xx.34:9300]]])

[2015-05-26 09:24:33,617][INFO ][discovery.zen            ] [hostname148] master_left [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], reason [transport disconnected]

[2015-05-26 09:24:33,619][DEBUG][action.admin.cluster.state] [hostname148] connection exception while trying to forward request to master node [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [hostname383][inet[/xx.xx.xx.34:9300]][cluster:monitor/state] disconnected]

What is that observer: timeout notification from cluster service. timeout setting [1m], time since start [1m] ? Could be helpful?

I also think one problem (the ping timeout) could cause several others. Correct?

Now I uninstalled Marvel for having less errors in the logs and trying to solve one error per time, I also changed my unicast cluster configuration from "hostname:port" to "IP:port" because I want to avoid any potential (and/or temporary) hostname dns resolution issues.

Any advices/thought?

Thanks

And now I changed it the configuration again using fully qualified domain and discovery.zen.minimum_master_nodes: 1 even if I have 5 server nodes + 2 client nodes'

Still getting the same error after a couple of hours:

[2015-05-26 13:33:37,024][WARN ][discovery.zen.ping.unicast] [hostname038.domain] failed to send ping to [[#zen_un
icast_3#][hostname038.domain][inet[hostname507.domain/xx.xx.xx.35:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [hostname507.domain][inet[/xx.xx.xx.35:9300]][inter
nal:discovery/zen/unicast] request_id [5521] timed out after [37501ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:531)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

and many more ..

What kind of check could I do over the network? (using tcpdump or other tools..)

Any advice is appreciated.

Thanks

have you ever been able to resolve this unavailable/master/2 issue? getting the same errors, tried multicast enable/disable to same effect.
pretty much all nodes are currently active up, and showing at best 20%-40% of the actual data

I am having the same problem with ES 2.1.0. Nodes are unable to ping each other random time throughout the day and the nodes leave, logstash chokes.

I would love to know if you have been able to solve this problem.

Have you solved this, I have the same problem with 2.2.0

Did you get solution of above issue? I am also facing similar issue regularly.
[discovery.zen.ping.unicast] [Machine-2] failed to send ping to [Machine2][][Machine-2][inet[/172.25.5.234:9300]]]

It could be due to multiple instance of ES running on same machine, a corrupted/killed ES has not released the port, check the instances running by executing jps, if more than one is running, kill the one which is not releasing the port.

Regards,
Saravana