Elasticsearch cluster fails to stabilize

maf · December 18, 2013, 7:22am

I recently built a new Elasticsearch cluster. 3 nodes running ES 0.90.7 and
Java 7u45. Each node has 60G of memory and 2T SSD disk. I loaded 3T of data
into an index with 128 shards on this cluster. So far so good.

I then added three more nodes (identical hw) and reconfigured the index to
have one replica. Then all hell breaks loose. The cluster starts to
replicate shards but before it manages to complete one of the nodes get
stuck doing garbage collections and everything stops working. If I restart
the cluster completely the scenario repeats.

There is a light load of queries coming into the cluster as it is trying to
stabilize and I noted that only some of the nodes were responding, even
though all nodes had completed shards. When digging further into this I
noted that when I do a get on _cluster/nodes/stats I receive a list showing
all six nodes, as expected. But when I get
_cluster/nodes/stats?indices=true the response only includes a subset of
the nodes. Sometimes it is three nodes, at the moment it is just one.

Looking in the logs on the machine I am doing the get of
_cluster/nodes/stats?indices=true on I see exceptions like this in the log:
[2013-12-18 07:19:55,707][DEBUG][action.admin.cluster.node.stats] [NODE1]
failed to execute on node [fAblkA8gRiuWP5_IWFihrA]
org.elasticsearch.transport.RemoteTransportException:
[NODE2][inet[/A.B.C.D:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineClosedException:
[reference][24] CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.ensureOpen(RobinEngine.java:969)
at
org.elasticsearch.index.engine.robin.RobinEngine.segmentsStats(RobinEngine.java:1181)
at
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:509)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:154)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

When the response only includes one node I get five of these, so they very
much feel related.

Does anybody have any idea what is happening here? Currently this cluster
is completely unusable.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0812ab27-c872-40af-8970-141a9bdcede8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

maf · December 18, 2013, 7:28am

I should probably also mention that I do not see any other exceptions than
the one mentioned above in any of the logs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/265b5d3c-08f8-4c9a-ba2a-54d91fb6d0b3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

maf · December 18, 2013, 8:04am

On some further debugging I enabled debug logging on one of the nodes. Now
when I try to get the indices stats I get the following in the log on the
debugging node:
[2013-12-18 08:02:01,078][DEBUG][index.shard.service ] [NODE6]
[reference][10] Can not build 'completion stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[reference][10] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:600)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:595)
at
org.elasticsearch.index.shard.service.InternalIndexShard.completionStats(InternalIndexShard.java:536)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:151)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Looking in head I see that this node has a number of green shards, but
shard 10 is yellow (recovering). This smells like a bug.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62fdf587-7779-4814-97a6-f1381993eba5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · December 18, 2013, 10:00am

Hey Martin,

the exception only tells you, that statistics cannot be gathered while a
shard is recovering. A recovery itself is nothing bad (should finish
sometime though).

Does the shard flip from recovering to started at some time?

In the next release this exception will vanish, as we handle it differently
(getting statistics while a shard is recovering will fail more graceful).

--Alex

On Wed, Dec 18, 2013 at 9:04 AM, Martin Forssen maf@recordedfuture.comwrote:

On some further debugging I enabled debug logging on one of the nodes. Now
when I try to get the indices stats I get the following in the log on the
debugging node:
[2013-12-18 08:02:01,078][DEBUG][index.shard.service ] [NODE6]
[reference][10] Can not build 'completion stats' from engine shard state
[RECOVERING]
org.elasticsearch.index.shard.IllegalIndexShardStateException:
[reference][10] CurrentState[RECOVERING] operations only allowed when
started/relocated
at
org.elasticsearch.index.shard.service.InternalIndexShard.readAllowed(InternalIndexShard.java:765)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:600)
at
org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:595)
at
org.elasticsearch.index.shard.service.InternalIndexShard.completionStats(InternalIndexShard.java:536)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:151)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:165)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:100)
at
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:273)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:264)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Looking in head I see that this node has a number of green shards, but
shard 10 is yellow (recovering). This smells like a bug.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/62fdf587-7779-4814-97a6-f1381993eba5%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9mwdr1RoEaPsVHS5fq%3DydygZjoSiNDn6e%3DFgT8jw9xCw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

maf · December 18, 2013, 10:19am

On Wednesday, December 18, 2013 11:00:07 AM UTC+1, Alexander Reelsen wrote:

the exception only tells you, that statistics cannot be gathered while a
shard is recovering. A recovery itself is nothing bad (should finish
sometime though).

But should this prevent the node from showing up at all in the response to
_cluster/nodes/stats?indices=true ?
Currently all nodes in my cluster have at least one shard which is
recovering so the indices call above returns no nodes at all.

The shards do eventually flip over to started.

The reason I care about this is that I am trying to figure out why the
cluster eventually falls over. And this problem prevents me from getting
statistics like number of calls and cache sizes form teh different nodes.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e714f98-1be8-49df-a204-31e04cc30173%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Disappearing Shards Elasticsearch	10	411	July 6, 2017
Cluster Failure Elasticsearch	2	253	July 6, 2017
Cluster Health degraded overnight with no apparent reason Elasticsearch	5	1695	July 6, 2017
Recovery failed for shard Elasticsearch	1	524	July 6, 2017
Elasticsearch issue Elasticsearch	13	2055	July 6, 2017

Elasticsearch cluster fails to stabilize

Related topics