Strange cluster restart loop

Hey guys,

We had an outage yesterday and I'm trying to figure out what happened. Our
cluster consists of 1 index across 6 nodes, 2 nodes we call routers (data =
false, http-enabled = true, master = true) and 4 nodes we call data nodes
(data = true, http-enabled = false, master = false). 3 hours prior to the
outage we had deleted the index and then, for some reason, our cluster
state went red, then yellow, then green, then yellow etc etc. In paramedic
we saw a couple of nodes drop out of the cluster a few times. I went and
had a look at the logs from one of the data nodes and one of the router
nodes and this is what they looked like:

Data Node Log:*
[2013-05-21 15:05:33,576][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: stopping ...
[2013-05-21 15:05:33,695][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: stopped
[2013-05-21 15:05:33,695][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: closing ...
[2013-05-21 15:05:33,732][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: closed
[2013-05-21 15:05:42,603][WARN ][common.jna ] Unknown
mlockall error 0
[2013-05-21 15:05:42,707][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: initializing ...
[2013-05-21 15:05:42,759][INFO ][plugins ]
[prod-inbox-es-data03] loaded [], sites [bigdesk, paramedic]
[2013-05-21 15:05:52,475][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: initialized
[2013-05-21 15:05:52,475][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: starting ...
[2013-05-21 15:05:52,838][INFO ][transport ]
[prod-inbox-es-data03] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.112.14.13:9300]}
[2013-05-21 15:05:57,456][INFO ][cluster.service ]
[prod-inbox-es-data03] detected_master
[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}, added
{[prod-inbox-es-data01][Y8sSS9irQoCfG3kBlvZqeQ][inet[/10.112.14.12:9300]]{master=false},[prod-inbox-es-data00][56rI82WOTNa5s4zlZn09EQ][inet[/10.108.14.12:9300]]{master=false},[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true},[prod-inbox-es-router01][i0a-aLzYSnyYNPIOUoZx3A][inet[/10.112.14.11:9300]]{data=false,
master=true},[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])
[2013-05-21 15:05:58,033][INFO ][discovery ]
[prod-inbox-es-data03] prod-inbox-locker/zJQKbcTuSyaQfbWgviA_BA
[2013-05-21 15:05:58,034][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: started
[2013-05-21 15:06:00,983][INFO ][cluster.service ]
[prod-inbox-es-data03] removed
{[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])
[2013-05-21 15:06:11,231][INFO ][cluster.service ]
[prod-inbox-es-data03] added
{[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])*

This section loops every minute indefinitely.

Router Node Log:
*[2013-05-21 15:05:33,720][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data03][4y1ikIxoQGWnNiy5QEt2mg][inet[/10.112.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data03][4y1ikIxoQGWnNiy5QEt2mg][inet[/10.112.14.13:9300]]{master=false})
[2013-05-21 15:05:57,190][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false}])
[2013-05-21 15:05:57,484][DEBUG][action.admin.cluster.node.stats]
[prod-inbox-es-router00] failed to execute on node [zJQKbcTuSyaQfbWgviA_BA]
org.elasticsearch.transport.RemoteTransportException:
[prod-inbox-es-data03][inet[/10.112.14.13:9300]][cluster/nodes/stats/n]
Caused by: java.lang.NullPointerException
at
org.elasticsearch.action.support.nodes.NodeOperationResponse.writeTo(NodeOperationResponse.java:66)
at
org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:290)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:91)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:67)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:276)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:267)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
[2013-05-21 15:06:00,960][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false})
[2013-05-21 15:06:11,131][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false}])
[2013-05-21 15:06:20,695][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false})
[2013-05-21 15:06:28,569][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data03][Omh5vzxOTTGy7cGwwHP40Q][inet[/10.112.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data03][Omh5vzxOTTGy7cGwwHP40Q][inet[/10.112.14.13:9300]]{master=false}])
[2013-05-21 15:10:59,608][DEBUG][action.admin.indices.status]
[prod-inbox-es-router00] [inbox-00][3], node[Omh5vzxOTTGy7cGwwHP40Q], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@73e18989]
org.elasticsearch.transport.RemoteTransportException:
[prod-inbox-es-data03][inet[/10.112.14.13:9300]][indices/status/s]
Caused by: org.elasticsearch.indices.IndexMissingException: [inbox-00]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
*

That last bit then loops indefinitely.

After a number of hours the data node log started to look like this:

I don't see any evidence of an OOM exception, the cluster was under minimal
load so I doubt that was the problem. There was also definitely enough
disk space available to all nodes to handle any data it received.

Just for some more context; our data nodes have 8gb RAM each, half is given
to the elasticsearch process.

Anyone got any ideas what to look at in order to get a better idea what
happened?

Cheers,
James

--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

James Lewis wrote:

We had an outage yesterday and I'm trying to figure out what
happened.

What version of ES?

Did you happen to notice any oom_killer activity in the kernel
log?

Drew

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry, It's version 20.6 of Elasticsearch and I haven't checked the kernel
log yet.

On Wed, May 22, 2013 at 10:24 PM, Drew Raines aaraines@gmail.com wrote:

James Lewis wrote:

We had an outage yesterday and I'm trying to figure out what happened.

What version of ES?

Did you happen to notice any oom_killer activity in the kernel log?

Drew

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**3DaAt9ExZJA/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/3DaAt9ExZJA/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey guys, I've solved this now. It was actually nothing to do with
Elasticsearch but something to do with cfengine (which we use to manage our
configs). There's a bug in the version of cfengine we use that was causing
it to trigger a restart on our nodes every few minutes. We've fixed it and
all is well.

Cheers,
James

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.