Red status caused by stuck initializing_shards


(piter) #1

I have a single index in initializing shards stuck with red status cluster. I tried to reassign index to another node without success because it is not an unassigned shard.
I have only primary shards and this is the status:

"cluster_name" : "cluster",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 10,
  "number_of_data_nodes" : 7,
  "active_primary_shards" : 830,
  "active_shards" : 1083,
  "relocating_shards" : 0,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.90774907749078

How can I resume the problematic shard?


(Thiago Souza) #2

You can get better understand why shard is not being assigned with GET /_cluster/allocation/explain


(piter) #3

Don't work. I use elastic 2.4.2

curl -X GET "localhost:9200/_cluster/allocation/explain" -H 'Content-Type: application/json' -d'
> {
>   "index": "myindex",
>   "shard": 0,
>   "primary": true
> }
> '
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_cluster","index":"_cluster"}],"type":"index_not_found_exception","reason":"no such index","resource.type":"index_expression","resource.id":"_cluster","index":"_cluster"},"status":404}

(Thiago Souza) #4

I asked the output of only GET /_cluster/allocation/explain. You added a body to the request, which I didn't ask. Please, redo the request without a body.


(Thiago Souza) #5

Ah ok. You use 2.x. That won't work anyway.


(Thiago Souza) #6

Well, you need to check the logs to understand why shards are not assigned. It is going to be more difficult.


(piter) #7

Whit the curl command I see an allocation failed:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep INI
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
19 85728 19 16301 0 0 38331 0 0:00:02 --:--:-- 0:00:02 38355 index_2018.01.21 2 p INITIALIZING ALLOCATION_FAILED

If I go into index translog folder I have:

cd /var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/translog

20951 file

This is the cluster.log file into node:

    [2018-06-12 16:17:33,044][WARN ][indices.cluster          ] [elk-node8] [[index_2018.01.21][2]] marking and sending shard failed due to [failed recovery]
    [index_2018.01.21][[index_2018.01.21][2]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to open reader on writer]; nested: IOException[Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]]; nested: IOException[Input/output error];
            at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:250)
            at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:56)
            at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:129)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    Caused by: [index_2018.01.21][[index_2018.01.21][2]] EngineCreationFailureException[failed to open reader on writer]; nested: IOException[Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]]; nested: IOException[Input/output error];
            at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:292)
            at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:163)
            at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
            at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1513)
            at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1497)
            at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:970)
            at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:942)
            at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:241)
            ... 5 more
    Caused by: java.io.IOException: Input/output error: NIOFSIndexInput(path="/var/elasticsearch/cluster/nodes/0/indices/index_2018.01.21/2/index/_kpv.cfs") [slice=_kpv_Lucene50_0.doc]
            at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:189)
            at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
            at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
            at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
            at org.apache.lucene.store.BufferedIndexInput.readInt(BufferedIndexInput.java:183)
            at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194)
            at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:255)
            at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader.<init>(Lucene50PostingsReader.java:86)
            at org.apache.lucene.codecs.lucene50.Lucene50PostingsFormat.fieldsProducer(Lucene50PostingsFormat.java:443)
            at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init>(PerFieldPostingsFormat.java:261)
            at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:341)
            at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:104)
            at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:65)
            at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
            at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:197)
            at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:99)
            at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:435)
            at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:100)
            at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:280)
            ... 12 more
    Caused by: java.io.IOException: Input/output error
            at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
            at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
            at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
            at sun.nio.ch.IOUtil.read(IOUtil.java:197)
            at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)
            at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)
            at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:179)
            ... 30 more
    [2018-06-12 16:17:33,301][WARN ][index.translog           ] [elk-node8] [index_2018.01.21][2] deleted previously created, but not yet committed, next generation [translog-4630.tlog]. This can happen due to a tragic exception when creating a new generation

(Thiago Souza) #8

That means either disk corruption cause by hardware failure (you could check dmesg) or you simply ran out of disk space.


(piter) #9

Can I recover the shard or I must to delete to have green status cluster?


(Thiago Souza) #10

The cluster is red because there are indices in red state. Until this is solved, cluster will stay in red state.

About recovery of the shard. It's hard to tell because there are indication of hardware problems.


(piter) #11

Can I try something else to recover also a part of data?


(Thiago Souza) #12

The logs here is showing that shard 2 of index index_2018.01.21 is corrupted. But there might be others as well.

You could try what is described here: Corrupted elastic index but if there is an underlying hardware issue then it won't help much.


(system) #13

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.