My online cluster frequently suffered from A lot many so sucked LockObtainFailedException

One of my online cluster start Elasticsearch(2.3.3) using the Linux supervise. May be some network reason, or may suffer from FullGC , a lot of LockObtainFailedException happened on some node, a lot of shard created failure, all logs looks like below:

ElasticsearchException[failed to create shard]; nested: LockObtainFailedException[Can't lock shard [wallet-bi-usertags-pass][40], timed out after 5000ms];
at org.elasticsearch.index.IndexService.createShard(IndexService.java:389)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:602)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:502)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:167)
at org.elasticsearch.cluster.service.InternalClusterService.runTasksForExecutor(InternalClusterService.java:616)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:778)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
  Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [wallet-bi-usertags-pass][40], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:623)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:551)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:306)
... 10 more

I suspect that when elasticsearch shutdown, and then dragged by supervise proccess, the Lucene write.lock does not realease by the elasticsearch JVM process.
Is there someone encountered the same so sucked situation like me? My current solution is killing the elasticsearch process, and then all things is ok...., but this will happen another day ....... so sucked...
Please help me....:sob:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.