ES Cluster State Red - cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy

JVM version: java version "1.8.0_05"

ES version :5.5.1

Plugins installed: [repository-gcs, repository-s3, x-pack, repository-azure]

I have a cluster of 106 nodes.

One of the shards suddenly went to red state.

Master Logs:
[2017-08-24T15:52:38,975][WARN ][o.e.c.a.s.ShardStateAction] [10.34.230.205] [discovery_details_45][2] received shard failed for shard id [[discovery_details_45][2]], allocation id [IbOY64SOQZ2128g6SIz6PQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.io.IOException: No space left on device
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:1548) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_05]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_05]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_05]
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) ~[?:?]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:?]
at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[?:?]
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195) ~[?:?]
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[?:1.8.0_05]
at java.nio.channels.Channels.writeFully(Channels.java:101) ~[?:1.8.0_05]
at java.nio.channels.Channels.access$000(Channels.java:61) ~[?:1.8.0_05]
at java.nio.channels.Channels$1.write(Channels.java:174) ~[?:1.8.0_05]
at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:419) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73) ~[?:1.8.0_05]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_05]
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) ~[?:1.8.0_05]
at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.RAMOutputStream.writeTo(RAMOutputStream.java:86) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:822) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:604) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:907) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:871) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:344) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]

Output of explain allocation API:

unassigned_info: {
reason: "ALLOCATION_FAILED",
at: "2017-08-24T11:04:03.079Z",
failed_allocation_attempts: 13,
details: "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[discovery_details_45][2]: obtaining shard lock timed out after 5000ms]; ",
last_allocation_status: "no"
},
can_allocate: "no",
allocate_explanation: "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",

Have tried the manual retry using curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed' multiple times, but it is not working

Also there is enough space left on the device.

I was running multiple re_index jobs on this cluster.

Hi akshaymaniyar,

not sure how much help this will be...

Some observations:

  • Looks like you are running Elasticsearch version 5.5.1, not 5.5.0 (probably just a typo)
  • Recommended JVM version for Elasticsearch 5.x is 1.8.0_131 pr later as far as I know

The issue might be something simmilart to Red Cluster State: failed to obtain in-memory shard lock · Issue #23199 · elastic/elasticsearch · GitHub

Recommended fix was

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

Good luck!

-AB

2 Likes

Sorry few typos:
java version "1.8.0_05"
ES version :5.5.1

Tried this command multiple times (curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed). It is not working.

It is not able to obtain a lock on the shard.

Sorry, should have read the original post to the very end. You did say you had run that command already...

I don't really have any other suggestions...

How many nodes do you have in total? Do you use Kibana or some other monitoring tool? Any more info in Kibana > Monitoring > Overview > Shard Activity (if you have it)?

-AB

Seems like the reindex task which was running was holding up the shard lock. As soon as I cancelled the reindex task, and fired the cluster reroute api, the cluster was green again.

Though why did this situation come at the first place, still needs to be found out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.