Red status caused by stuck initializing_shards

elk2 · June 8, 2018, 3:46pm

I have a single index in initializing shards stuck with red status cluster. I tried to reassign index to another node without success because it is not an unassigned shard.
I have only primary shards and this is the status:

"cluster_name" : "cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 7,
  "active_primary_shards" : 830,
  "active_shards" : 1083,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.90774907749078

How can I resume the problematic shard?

thiago · June 10, 2018, 12:33am

You can get better understand why shard is not being assigned with GET /_cluster/allocation/explain

elk2 · June 12, 2018, 10:17am

Don't work. I use elastic 2.4.2

curl -X GET "localhost:9200/_cluster/allocation/explain" -H 'Content-Type: application/json' -d'
> {
>   "index": "myindex",
>   "shard": 0,
>   "primary": true
> }
> '
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_cluster","index":"_cluster"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_cluster","index":"_cluster"},"status":404}

thiago · June 12, 2018, 11:04am

I asked the output of only GET /_cluster/allocation/explain. You added a body to the request, which I didn't ask. Please, redo the request without a body.

thiago · June 12, 2018, 11:05am

Ah ok. You use 2.x. That won't work anyway.

thiago · June 12, 2018, 11:06am

Well, you need to check the logs to understand why shards are not assigned. It is going to be more difficult.

elk2 · June 12, 2018, 2:11pm

Whit the curl command I see an allocation failed:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep INI
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
19 85728 19 16301 0 0 38331 0 0:00:02 --:--:-- 0:00:02 38355 index_2018.01.21 2 p INITIALIZING ALLOCATION_FAILED

If I go into index translog folder I have:

cd /var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/translog

20951 file

This is the cluster.log file into node:

    [2018-06-12 16:17:33,044][WARN ][indices.cluster          ] [elk-node8] [[index_2018.01.21][2]] marking and sending shard failed due to [failed recovery]
    [index_2018.01.21][[index_2018.01.21][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: IOException[Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]]; nested: IOException[Input/output error];
            at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
            at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
            at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: [index_2018.01.21][[index_2018.01.21][2]] EngineCreationFailureException[failed to open reader on writer]; nested: IOException[Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]]; nested: IOException[Input/output error];
            at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:292)
            at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:163)
            at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
            at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1513)
            at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1497)
            at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:970)
            at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:942)
            at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
            ... 5 more
    Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]
            at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:189)
            at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
            at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
            at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
            at org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:183)
            at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
            at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
            at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.<init>(Lucene50PostingsReader.java:86)
            at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:443)
            at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:261)
            at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:341)
            at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:104)
            at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:65)
            at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
            at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:197)
            at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:99)
            at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:435)
            at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:100)
            at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:280)
            ... 12 more
    Caused by: java.io.IOException: Input/output error
            at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
            at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
            at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
            at sun.nio.ch.IOUtil.read(IOUtil.java:197)
            at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
            at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
            at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
            ... 30 more
    [2018-06-12 16:17:33,301][WARN ][index.translog           ] [elk-node8] [index_2018.01.21][2] deleted previously created, but not yet committed, next generation [translog-4630.tlog]. This can happen due to a tragic exception when creating a new generation

thiago · June 12, 2018, 2:38pm

That means either disk corruption cause by hardware failure (you could check dmesg) or you simply ran out of disk space.

elk2 · June 19, 2018, 9:38am

Can I recover the shard or I must to delete to have green status cluster?

thiago · June 19, 2018, 11:35am

The cluster is red because there are indices in red state. Until this is solved, cluster will stay in red state.

About recovery of the shard. It's hard to tell because there are indication of hardware problems.

elk2 · June 21, 2018, 1:36pm

Can I try something else to recover also a part of data?

thiago · June 21, 2018, 2:20pm

The logs here is showing that shard 2 of index index_2018.01.21 is corrupted. But there might be others as well.

You could try what is described here: Corrupted elastic index but if there is an underlying hardware issue then it won't help much.

Topic		Replies	Views
Cluster Red - unallocated shards in a index Elasticsearch	2	1939	January 2, 2018
Unassigned Shard Elasticsearch	3	799	December 6, 2019
Shards failure - recovery possible? Elasticsearch	6	3707	May 9, 2020
Elastic `_cluster/health` showing unassigned shards Elasticsearch	1	616	June 22, 2023
Index in red, UNASSIGNED PRIMARY_FAILED Elasticsearch	2	3791	June 19, 2017

Red status caused by stuck initializing_shards

Related topics