Uhm..
I also installed Marvel yesterday and today I found several additional errors in the logs, like:
[2015-05-26 07:49:13,274][ERROR][marvel.agent.exporter ] [hostname148] remote target didn't respond with 200 OK response code [503 Service Unavailable]. content: [:)
^E��error�ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]��status$^O��]
or
[2015-05-26 01:13:32,667][WARN ][cluster.action.shard ] [hostname148] [.marvel-2015.05.25][0] sending failed shard for [.marvel-2015.05.25][0], node[sRKyvpnkSkGrWKG7npvLgw], [R], s[STARTED], indexUUID [U5eSpJQGRA2nc66Ll7nHug], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [SendRequestTransportException[[hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]]; nested: NodeNotConnectedException[[hostname507][inet[/xx.xx.xx.35:9300]] Node not connected]; ]]
[2015-05-26 01:13:34,069][WARN ][action.bulk ] [hostname148] Failed to perform indices:data/write/bulk[s] on remote replica [hostname507][sRKyvpnkSkGrWKG7npvLgw][hostname507.domain][inet[/xx.xx.xx.35:9300]][.marvel-2015.05.25][0]
org.elasticsearch.transport.SendRequestTransportException: [hostname507][inet[/xx.xx.xx.35:9300]][indices:data/write/bulk[s][r]]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:284)
I also notice the cluster got stuck and not responding for almost 2 hours.
Like this morning at 9.10am I saw all the elasticsearch logs stuck at 7:51am, the elasticsearch process still running but only 1 node of 5 responding properly when i run a
curl 'http://localhost:9200/_cat/indices?v'
I found the strace of some elasticsearch threads:
# strace -p 17524
Process 17524 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71633, 270610901}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71634, 270610901}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7f46e40b4428, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {71634, 270917231}) = 0
futex(0x7f46e40b4454, FUTEX_WAIT_BITSET_PRIVATE, 1, {71635, 270917231}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
And, after a while the logs start to be written again and this is the piece of log written:
[2015-05-26 07:51:33,351][DEBUG][action.bulk ] [hostname148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-05-26 07:51:33,355][ERROR][marvel.agent.exporter ] [hostname148] create failure (index:[.marvel-2015.05.25] type: [node_stats]): ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
[2015-05-26 07:51:34,886][INFO ][cluster.service ] [hostname148] detected_master [hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]], added {[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]],[hostname108][3dBX7Jw1TFCcHui738YwbA][hostname108][inet[/xx.xx.xx.138:9300]]{data=false, master=false},[hostname036][JpvivKubSPCWC_yWesH4rA][hostname036][inet[/xx.xx.xx.137:9300]]{data=false, master=false},}, reason: zen-disco-receive(from master [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]])
[2015-05-26 07:51:42,584][INFO ][cluster.service ] [hostname148] added {[hostname272][pwa_oGcsTDeqaCwPN_Z1kg][hostname272.domain][inet[/xx.xx.xx.33:9300]],}, reason: zen-disco-receive(from master [[n036hd0l8383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.s4.chp.cba][inet[/1xx.xx.xx.34:9300]]])
[2015-05-26 09:24:33,617][INFO ][discovery.zen ] [hostname148] master_left [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], reason [transport disconnected]
[2015-05-26 09:24:33,619][DEBUG][action.admin.cluster.state] [hostname148] connection exception while trying to forward request to master node [[hostname383][48wxUSgcRg-UfbYv-iBTkQ][hostname383.domain][inet[/xx.xx.xx.34:9300]]], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [hostname383][inet[/xx.xx.xx.34:9300]][cluster:monitor/state] disconnected]
What is that observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
? Could be helpful?
I also think one problem (the ping timeout) could cause several others. Correct?
Now I uninstalled Marvel for having less errors in the logs and trying to solve one error per time, I also changed my unicast cluster configuration from "hostname:port" to "IP:port" because I want to avoid any potential (and/or temporary) hostname dns resolution issues.
Any advices/thought?
Thanks