Elasticsearch cluster fails to stabilize

I recently built a new Elasticsearch cluster. 3 nodes running ES 0.90.7 and
Java 7u45. Each node has 60G of memory and 2T SSD disk. I loaded 3T of data
into an index with 128 shards on this cluster. So far so good.

I then added three more nodes (identical hw) and reconfigured the index to
have one replica. Then all hell breaks loose. The cluster starts to
replicate shards but before it manages to complete one of the nodes get
stuck doing garbage collections and everything stops working. If I restart
the cluster completely the scenario repeats.

There is a light load of queries coming into the cluster as it is trying to
stabilize and I noted that only some of the nodes were responding, even
though all nodes had completed shards. When digging further into this I
noted that when I do a get on _cluster/nodes/stats I receive a list showing
all six nodes, as expected. But when I get
_cluster/nodes/stats?indices=true the response only includes a subset of
the nodes. Sometimes it is three nodes, at the moment it is just one.

Looking in the logs on the machine I am doing the get of
_cluster/nodes/stats?indices=true on I see exceptions like this in the log:
[2013-12-18 07:19:55,707][DEBUG][action.admin.cluster.node.stats] [NODE1]
failed to execute on node [fAblkA8gRiuWP5_IWFihrA]
org.elasticsearch.transport.RemoteTransportException:
[NODE2][inet[/A.B.C.D:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException:
[reference][24] CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.ensureOpen(RobinEngine.java:969)
at
org.elasticsearch.index.engine.robin.RobinEngine.segmentsStats(RobinEngine.java:1181)
at
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:509)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:154)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

When the response only includes one node I get five of these, so they very
much feel related.

Does anybody have any idea what is happening here? Currently this cluster
is completely unusable.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0812ab27-c872-40af-8970-141a9bdcede8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I should probably also mention that I do not see any other exceptions than
the one mentioned above in any of the logs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/265b5d3c-08f8-4c9a-ba2a-54d91fb6d0b3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

On some further debugging I enabled debug logging on one of the nodes. Now
when I try to get the indices stats I get the following in the log on the
debugging node:
[2013-12-18 08:02:01,078][DEBUG][index.shard.service ] [NODE6]
[reference][10] Can not build 'completion stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[reference][10] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:600)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:595)
at
org.elasticsearch.index.shard.service.InternalIndexShard.completionStats(InternalIndexShard.java:536)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:151)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Looking in head I see that this node has a number of green shards, but
shard 10 is yellow (recovering). This smells like a bug.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62fdf587-7779-4814-97a6-f1381993eba5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Martin,

the exception only tells you, that statistics cannot be gathered while a
shard is recovering. A recovery itself is nothing bad (should finish
sometime though).

Does the shard flip from recovering to started at some time?

In the next release this exception will vanish, as we handle it differently
(getting statistics while a shard is recovering will fail more graceful).

--Alex

On Wed, Dec 18, 2013 at 9:04 AM, Martin Forssen maf@recordedfuture.comwrote:

On some further debugging I enabled debug logging on one of the nodes. Now
when I try to get the indices stats I get the following in the log on the
debugging node:
[2013-12-18 08:02:01,078][DEBUG][index.shard.service ] [NODE6]
[reference][10] Can not build 'completion stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[reference][10] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:600)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:595)
at
org.elasticsearch.index.shard.service.InternalIndexShard.completionStats(InternalIndexShard.java:536)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:151)

at

org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Looking in head I see that this node has a number of green shards, but
shard 10 is yellow (recovering). This smells like a bug.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62fdf587-7779-4814-97a6-f1381993eba5%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9mwdr1RoEaPsVHS5fq%3DydygZjoSiNDn6e%3DFgT8jw9xCw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

On Wednesday, December 18, 2013 11:00:07 AM UTC+1, Alexander Reelsen wrote:

the exception only tells you, that statistics cannot be gathered while a
shard is recovering. A recovery itself is nothing bad (should finish
sometime though).

But should this prevent the node from showing up at all in the response to
_cluster/nodes/stats?indices=true ?
Currently all nodes in my cluster have at least one shard which is
recovering so the indices call above returns no nodes at all.

The shards do eventually flip over to started.

The reason I care about this is that I am trying to figure out why the
cluster eventually falls over. And this problem prevents me from getting
statistics like number of calls and cache sizes form teh different nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e714f98-1be8-49df-a204-31e04cc30173%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.