I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:
[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more
This is repeated a few times.
Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.
I can see that write.lock file in the data directory mentioned in the error messsage:
OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.
I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:
[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more
This is repeated a few times.
Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.
I can see that write.lock file in the data directory mentioned in the error messsage:
We think we've traced this down to a bug in our deployment script, leading to the maximum number of file handles being incorrectly left at 1024 on the new machines. One of our app servers was logging about 'too many open files' in a error back from ES. We've corrected that on the remaining new servers, and things seem to be running a bit more smoothly now.
I'll keep the node below out for the next few hours in case someone wants to know more about it.
OK - I needed to get things running quickly, so I've taken that node out of the cluster. I won't add it back for now, so we can inspect its state if necessary. We did lose some data, but I can reindex that from source.
I added some more servers to our ES 0.19.8 cluster over the weekend. All went smoothly, except for one index. The logs on one of the nodes hosting the index have the following warning:
[2013-09-02 00:18:04,082][WARN ][cluster.action.shard ] [Fang] sending failed shard for [production_tweets_for_2013_36][1], node[NA4cKJBnQ3WpQjXj8waaGg], [P], s[INITIALIZING], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[production_tweets_for_2013_36][1] Failed to fetch index version after copying it over]; nested: LockObtainFailedException[Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock]; ]]
[2013-09-02 00:18:05,182][WARN ][indices.cluster ] [Fang] [production_tweets_for_2013_36][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [production_tweets_for_2013_36][1] Failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:123)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/opt/production-es/elasticsearch/data/production-es/nodes/0/indices/production_tweets_for_2013_36/1/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1098)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:116)
... 4 more
This is repeated a few times.
Looking at ES Head, I can see that shard 1 has a primary on that node, which seems to be stick in INITIALIZING, and there's an unallocated replica. We run with 5 shards, 1 replica. Cluster state is currently red - I guess because ES can't create that replica.
I can see that write.lock file in the data directory mentioned in the error messsage:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.