ES Cluster State Red - cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy

akshaymaniyar · August 24, 2017, 11:16am

JVM version: java version "1.8.0_05"

ES version :5.5.1

Plugins installed: [repository-gcs, repository-s3, x-pack, repository-azure]

I have a cluster of 106 nodes.

One of the shards suddenly went to red state.

Master Logs:
[2017-08-24T15:52:38,975][WARN ][o.e.c.a.s.ShardStateAction] [10.34.230.205] [discovery_details_45][2] received shard failed for shard id [[discovery_details_45][2]], allocation id [IbOY64SOQZ2128g6SIz6PQ], primary term [0], message [shard failure, reason [merge failed]], failure [NotSerializableExceptionWrapper[merge_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ]
org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: merge_exception: java.io.IOException: No space left on device
at org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$2.doRun(InternalEngine.java:1548) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.5.1.jar:5.5.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.5.1.jar:5.5.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_05]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_05]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_05]
Caused by: java.io.IOException: No space left on device
at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) ~[?:?]
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[?:?]
at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[?:?]
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:195) ~[?:?]
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) ~[?:1.8.0_05]
at java.nio.channels.Channels.writeFully(Channels.java:101) ~[?:1.8.0_05]
at java.nio.channels.Channels.access$000(Channels.java:61) ~[?:1.8.0_05]
at java.nio.channels.Channels$1.write(Channels.java:174) ~[?:1.8.0_05]
at org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:419) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73) ~[?:1.8.0_05]
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_05]
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) ~[?:1.8.0_05]
at org.apache.lucene.store.OutputStreamIndexOutput.writeBytes(OutputStreamIndexOutput.java:53) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.RateLimitedIndexOutput.writeBytes(RateLimitedIndexOutput.java:73) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.DataOutput.writeBytes(DataOutput.java:52) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.store.RAMOutputStream.writeTo(RAMOutputStream.java:86) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:822) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:604) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:907) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:871) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:344) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:216) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:101) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]

Output of explain allocation API:

unassigned_info: {
reason: "ALLOCATION_FAILED",
at: "2017-08-24T11:04:03.079Z",
failed_allocation_attempts: 13,
details: "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[discovery_details_45][2]: obtaining shard lock timed out after 5000ms]; ",
last_allocation_status: "no"
},
can_allocate: "no",
allocate_explanation: "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",

Have tried the manual retry using curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed' multiple times, but it is not working

Also there is enough space left on the device.

I was running multiple re_index jobs on this cluster.

A_B · August 24, 2017, 12:03pm

Hi akshaymaniyar,

not sure how much help this will be...

Some observations:

Looks like you are running Elasticsearch version 5.5.1, not 5.5.0 (probably just a typo)
Recommended JVM version for Elasticsearch 5.x is 1.8.0_131 pr later as far as I know

The issue might be something simmilart to Red Cluster State: failed to obtain in-memory shard lock · Issue #23199 · elastic/elasticsearch · GitHub

Recommended fix was

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

Good luck!

-AB

akshaymaniyar · August 24, 2017, 12:10pm

Sorry few typos:
java version "1.8.0_05"
ES version :5.5.1

Tried this command multiple times (curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed). It is not working.

It is not able to obtain a lock on the shard.

A_B · August 24, 2017, 12:27pm

Sorry, should have read the original post to the very end. You did say you had run that command already...

I don't really have any other suggestions...

How many nodes do you have in total? Do you use Kibana or some other monitoring tool? Any more info in Kibana > Monitoring > Overview > Shard Activity (if you have it)?

-AB

akshaymaniyar · August 24, 2017, 5:13pm

Seems like the reindex task which was running was holding up the shard lock. As soon as I cancelled the reindex task, and fired the cluster reroute api, the cluster was green again.

Though why did this situation come at the first place, still needs to be found out.

system · September 21, 2017, 5:13pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy Elasticsearch	4	5217	December 1, 2022
Cluster Red - unallocated shards in a index Elasticsearch	3	1848	January 30, 2018
Elasticsearch Cluster Status is RED Elasticsearch elastic-stack-monitoring	12	708	June 29, 2021
Red Cluster State: failed to create shard, failure IOException[failed to obtain in-memory shard lock] Elasticsearch	1	514	September 15, 2020
Status red of elastic search Elasticsearch	1	377	April 23, 2018

ES Cluster State Red - cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy

Related topics