Initializing and locks

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.

Cheers,
Dan

On 2 Sep 2013, at 09:46, Dan Fairs dan.fairs@gmail.com wrote:

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sorry to keep replying to my own post!

We think we've traced this down to a bug in our deployment script, leading to the maximum number of file handles being incorrectly left at 1024 on the new machines. One of our app servers was logging about 'too many open files' in a error back from ES. We've corrected that on the remaining new servers, and things seem to be running a bit more smoothly now.

I'll keep the node below out for the next few hours in case someone wants to know more about it.

Cheers,
Dan

On 2 Sep 2013, at 10:03, Dan Fairs dan.fairs@gmail.com wrote:

OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.

Cheers,
Dan

On 2 Sep 2013, at 09:46, Dan Fairs dan.fairs@gmail.com wrote:

Hi,

I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:

[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more

This is repeated a few times.

Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.

I can see that write.lock file in the data directory mentioned in the error messsage:

-rw-r--r-- 1 production-es production-es 8 Sep 1 23:59 _checksums-1378079941931
-rw-r--r-- 1 production-es production-es 58 Sep 1 23:59 segments_6w3
-rw-r--r-- 1 production-es production-es 20 Sep 1 23:59 segments.gen
-rw-r--r-- 1 production-es production-es 0 Sep 1 23:59 write.lock

What's the best way forward here?

(The cluster now has 6 nodes).

Thanks,
Dan

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
Dan Fairs | dan.fairs@gmail.com | @danfairs | secondsync.com

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.