Shard failures

Drew_Blessing · April 24, 2014, 7:35pm

We have been struggling with this issue for a few months. We've experienced
it in versions 0.90.6 - 0.90.13 and now in 1.1, too.

A shard (sometimes 2) will fail within a single index. We get this error
during or after our data loader indexes data. Sometimes it takes a day or
two to occur but most recently it's been immediately on/after first index.
The shard that fails is always in the same index. This is a 2-node cluster
running on CentOS 6.5 with Oracle Java 1.7.0u51. In addition, there is 1
non-data node for the process that handles indexing data and 2 non-data
nodes serving the front-end. All non-data nodes are java clients using
spring-data-elasticsearch library. All nodes are on 1.1 now. We understand
there's a probability that our loader application is causing this but we
can't see how or where. Also, it seems like a bug if a client can cause
shards to fail on the server. We are grasping at straws now and appreciate
any ideas on what could be causing this.

In this gist, the first log message is from node 1 and happens at the same
time that the shard failure occurs on node 2. See node 2 for the stack(s)
that occur when the shard fails. It's interesting that node 1 says it's
closing the connection because of it. Someone on Twitter noted that these
are WARN level messages and don't signify a failure. However, it is causing
queries against this index to totally fail, so there's definitely more than
a WARN scenario going on here. Any thoughts?

gist.github.com

https://gist.github.com/dblessing/11266650

Node1

This is the other node in the cluster. It doesn't have a failed shard but does give this message at the same 
time the shard fails on the other node.
------

[2014-04-24 14:04:06,732][WARN ][cluster.action.shard     ] [servicebus-es1.bkeprd.com] [product][0] 
received shard failed for [product][0], node[ImXrGRu0SXm0rgQWKEu-kw], [P], s[STARTED], 
indexUUID [SIOpzgF_RoWF6MBGnx4N8w], reason [engine failure, message [MergeException[org.apache.lucene.store.
AlreadyClosedException: this IndexReader is closed]; nested: AlreadyClosedException[this IndexReader is closed]; ]]

Node2

This is the node with the shard that fails.
------

[2014-04-23 15:43:14,141][INFO ][cluster.metadata         ] [servicebus-es1.example.com] [product] update_mapping [product] (dynamic)
[2014-04-23 15:46:00,040][INFO ][cluster.metadata         ] [servicebus-es1.example.com] [product] update_mapping [product] (dynamic)
[2014-04-23 15:47:26,884][WARN ][transport                ] [servicebus-es1.example.com] Received response for a request that has timed out, sent [31301ms] ago, timed out [1300ms] ago, action [discovery/zen/fd/ping], node [[servicebus-PROD-tomcatapps3.example.com][ewvSIKmYRfS2qKiojcUw7w][tomcatapps3.example.com][inet[/10.0.0.119:9300]]{client=true, data=false, local=false}], id [160544]
[2014-04-23 16:30:38,734][INFO ][cluster.service          ] [servicebus-es1.example.com] removed {[servicebus-PROD-tomcatapps3.example.com][ewvSIKmYRfS2qKiojcUw7w][tomcatapps3.example.com][inet[/10.0.0.119:9300]]{client=true, data=false, local=false},}, reason: zen-disco-node_left([servicebus-PROD-tomcatapps3.example.com][ewvSIKmYRfS2qKiojcUw7w][tomcatapps3.example.com][inet[/10.0.0.119:9300]]{client=true, data=false, local=false})
[2014-04-23 16:30:49,756][INFO ][cluster.service          ] [servicebus-es1.example.com] added {[servicebus-PROD-tomcatapps3.example.com][ixCfue8YRMq3XhOOcdlmEg][tomcatapps3.example.com][inet[/10.0.0.119:9300]]{client=true, data=false, local=false},}, reason: zen-disco-receive(join from node[[servicebus-PROD-tomcatapps3.example.com][ixCfue8YRMq3XhOOcdlmEg][tomcatapps3.example.com][inet[/10.0.0.119:9300]]{client=true, data=false, local=false}])
[2014-04-23 16:35:51,054][WARN ][index.translog           ] [servicebus-es1.example.com] [product][4] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException: [product][4] Flush failed

This file has been truncated. show original

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/371aa3d3-1b02-4fb5-bad7-b6217e09fb6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

spinscale · April 26, 2014, 12:32am

Hey,

do you run any additional plugins or is this stock elasticsearch? Can you
tell what happens on your cluster? Do you have long running
queries/operations? Can you tell more, what/how this loader executes?

Minor operational hint: Upgrade your JVM version to the latest one or _25,
the one your are using could lead to data corruption with lucene.

--Alex

On Thu, Apr 24, 2014 at 3:35 PM, Drew Blessing blessing.drew@gmail.comwrote:

We have been struggling with this issue for a few months. We've
experienced it in versions 0.90.6 - 0.90.13 and now in 1.1, too.

A shard (sometimes 2) will fail within a single index. We get this error
during or after our data loader indexes data. Sometimes it takes a day or
two to occur but most recently it's been immediately on/after first index.
The shard that fails is always in the same index. This is a 2-node cluster
running on CentOS 6.5 with Oracle Java 1.7.0u51. In addition, there is 1
non-data node for the process that handles indexing data and 2 non-data
nodes serving the front-end. All non-data nodes are java clients using
spring-data-elasticsearch library. All nodes are on 1.1 now. We understand
there's a probability that our loader application is causing this but we
can't see how or where. Also, it seems like a bug if a client can cause
shards to fail on the server. We are grasping at straws now and appreciate
any ideas on what could be causing this.

In this gist, the first log message is from node 1 and happens at the same
time that the shard failure occurs on node 2. See node 2 for the stack(s)
that occur when the shard fails. It's interesting that node 1 says it's
closing the connection because of it. Someone on Twitter noted that these
are WARN level messages and don't signify a failure. However, it is causing
queries against this index to totally fail, so there's definitely more than
a WARN scenario going on here. Any thoughts?

Shard failure · GitHub

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/371aa3d3-1b02-4fb5-bad7-b6217e09fb6a%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/371aa3d3-1b02-4fb5-bad7-b6217e09fb6a%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM8E7spY%2BOU9PHfMBuLujqCZeCLXPmM4fdDasMvGyXnAgg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Failed to flush shard on translog threshold Elasticsearch	1	1165	July 6, 2017
ES nodes crashing: failed to send failed shard Elasticsearch	6	2519	July 5, 2017
All shards failed across multiple indexes Elasticsearch	6	519	June 2, 2022
Frequent shard failures Elasticsearch	7	690	July 20, 2023
ElasticSearch Java client fails with multiple threads Elasticsearch	1	1468	June 27, 2017

Shard failures

Related topics