Hey guys,
We had an outage yesterday and I'm trying to figure out what happened. Our
cluster consists of 1 index across 6 nodes, 2 nodes we call routers (data =
false, http-enabled = true, master = true) and 4 nodes we call data nodes
(data = true, http-enabled = false, master = false). 3 hours prior to the
outage we had deleted the index and then, for some reason, our cluster
state went red, then yellow, then green, then yellow etc etc. In paramedic
we saw a couple of nodes drop out of the cluster a few times. I went and
had a look at the logs from one of the data nodes and one of the router
nodes and this is what they looked like:
Data Node Log:*
[2013-05-21 15:05:33,576][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: stopping ...
[2013-05-21 15:05:33,695][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: stopped
[2013-05-21 15:05:33,695][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: closing ...
[2013-05-21 15:05:33,732][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[6491]: closed
[2013-05-21 15:05:42,603][WARN ][common.jna ] Unknown
mlockall error 0
[2013-05-21 15:05:42,707][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: initializing ...
[2013-05-21 15:05:42,759][INFO ][plugins ]
[prod-inbox-es-data03] loaded [], sites [bigdesk, paramedic]
[2013-05-21 15:05:52,475][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: initialized
[2013-05-21 15:05:52,475][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: starting ...
[2013-05-21 15:05:52,838][INFO ][transport ]
[prod-inbox-es-data03] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.112.14.13:9300]}
[2013-05-21 15:05:57,456][INFO ][cluster.service ]
[prod-inbox-es-data03] detected_master
[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}, added
{[prod-inbox-es-data01][Y8sSS9irQoCfG3kBlvZqeQ][inet[/10.112.14.12:9300]]{master=false},[prod-inbox-es-data00][56rI82WOTNa5s4zlZn09EQ][inet[/10.108.14.12:9300]]{master=false},[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true},[prod-inbox-es-router01][i0a-aLzYSnyYNPIOUoZx3A][inet[/10.112.14.11:9300]]{data=false,
master=true},[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])
[2013-05-21 15:05:58,033][INFO ][discovery ]
[prod-inbox-es-data03] prod-inbox-locker/zJQKbcTuSyaQfbWgviA_BA
[2013-05-21 15:05:58,034][INFO ][node ]
[prod-inbox-es-data03] {0.20.6}[1623]: started
[2013-05-21 15:06:00,983][INFO ][cluster.service ]
[prod-inbox-es-data03] removed
{[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])
[2013-05-21 15:06:11,231][INFO ][cluster.service ]
[prod-inbox-es-data03] added
{[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(from master
[[prod-inbox-es-router00][bpIZmoXGRW-Y3HV9d-I0Bw][inet[/10.108.14.11:9300]]{data=false,
master=true}])*
This section loops every minute indefinitely.
Router Node Log:
*[2013-05-21 15:05:33,720][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data03][4y1ikIxoQGWnNiy5QEt2mg][inet[/10.112.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data03][4y1ikIxoQGWnNiy5QEt2mg][inet[/10.112.14.13:9300]]{master=false})
[2013-05-21 15:05:57,190][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false}])
[2013-05-21 15:05:57,484][DEBUG][action.admin.cluster.node.stats]
[prod-inbox-es-router00] failed to execute on node [zJQKbcTuSyaQfbWgviA_BA]
org.elasticsearch.transport.RemoteTransportException:
[prod-inbox-es-data03][inet[/10.112.14.13:9300]][cluster/nodes/stats/n]
Caused by: java.lang.NullPointerException
at
org.elasticsearch.action.support.nodes.NodeOperationResponse.writeTo(NodeOperationResponse.java:66)
at
org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:290)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:91)
at
org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:67)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:276)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:267)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
[2013-05-21 15:06:00,960][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data02][1PP_akyURnSA4vlKdv-Kuw][inet[/10.108.14.13:9300]]{master=false})
[2013-05-21 15:06:11,131][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data02][y4DzOaG7TcmOoxUFvG53qQ][inet[/10.108.14.13:9300]]{master=false}])
[2013-05-21 15:06:20,695][INFO ][cluster.service ]
[prod-inbox-es-router00] removed
{[prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false},},
reason:
zen-disco-node_left([prod-inbox-es-data03][zJQKbcTuSyaQfbWgviA_BA][inet[/10.112.14.13:9300]]{master=false})
[2013-05-21 15:06:28,569][INFO ][cluster.service ]
[prod-inbox-es-router00] added
{[prod-inbox-es-data03][Omh5vzxOTTGy7cGwwHP40Q][inet[/10.112.14.13:9300]]{master=false},},
reason: zen-disco-receive(join from
node[[prod-inbox-es-data03][Omh5vzxOTTGy7cGwwHP40Q][inet[/10.112.14.13:9300]]{master=false}])
[2013-05-21 15:10:59,608][DEBUG][action.admin.indices.status]
[prod-inbox-es-router00] [inbox-00][3], node[Omh5vzxOTTGy7cGwwHP40Q], [P],
s[STARTED]: Failed to execute
[org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@73e18989]
org.elasticsearch.transport.RemoteTransportException:
[prod-inbox-es-data03][inet[/10.112.14.13:9300]][indices/status/s]
Caused by: org.elasticsearch.indices.IndexMissingException: [inbox-00]
missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
at
org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
at
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
*
That last bit then loops indefinitely.
After a number of hours the data node log started to look like this:
I don't see any evidence of an OOM exception, the cluster was under minimal
load so I doubt that was the problem. There was also definitely enough
disk space available to all nodes to handle any data it received.
Just for some more context; our data nodes have 8gb RAM each, half is given
to the elasticsearch process.
Anyone got any ideas what to look at in order to get a better idea what
happened?
Cheers,
James
--
This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.