Initializing and locks

Dan_Fairs · September 2, 2013, 8:46am

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dan_Fairs · September 2, 2013, 9:03am

OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.

Cheers,
Dan

On 2 Sep 2013, at 09:46, Dan Fairs dan.fairs@gmail.com wrote:

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dan_Fairs · September 2, 2013, 9:28am

Sorry to keep replying to my own post!

We think we've traced this down to a bug in our deployment script, leading to the maximum number of file handles being incorrectly left at 1024 on the new machines. One of our app servers was logging about 'too many open files' in a error back from ES. We've corrected that on the remaining new servers, and things seem to be running a bit more smoothly now.

I'll keep the node below out for the next few hours in case someone wants to know more about it.

Cheers,
Dan

On 2 Sep 2013, at 10:03, Dan Fairs dan.fairs@gmail.com wrote:

OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.

Cheers,
Dan

On 2 Sep 2013, at 09:46, Dan Fairs dan.fairs@gmail.com wrote:

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
LockObtainFailedException in ES Elasticsearch	8	3073	June 2, 2017
Could not initialize the shard Elasticsearch	1	473	July 6, 2017
Shard initialization stuck - RecoveryFailedException Elasticsearch	5	2852	April 2, 2019
Indexing/shard failure Elasticsearch	5	967	July 6, 2017
Shard failures after upgrading from 0.18.7 to 0.19.0 Elasticsearch	6	599	July 6, 2017

Initializing and locks

Related topics