Index/shard API Unresponsive

Hi,
We have an old cluster running at version 1.3.2 with 6 data nodes and 2 masters. Aim, the cluster is red with two shards in an initialising state and two unassigned:

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 71,
  "active_shards" : 140,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 2
}

However, I'm unable to determine which indices/shards have a problem because the cat/indices and cat/shards hang indefinitely. Can anyone advise on why this might be happening and how I can tackle it?

Regards,
David

Can you attach the ES logs?

Hi,
Here's what we're getting this morning:

[2017-04-05 04:30:49,460][WARN ][index.merge.scheduler    ] [node06] [index-2017.04.05][2] failed to merge
java.io.IOException: Input/output error: NIOFSIndexInput(path="/apps/elasticsearch/disk3/elasticsearch/nodes/0/indices/index-2017.04.05/2/index/_q8_es090_0.doc")
        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:186)
        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)
        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)
        at org.apache.lucene.store.DataInput.readVInt(DataInput.java:126)
        at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)
        at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.readVIntBlock(Lucene41PostingsReader.java:126)
        at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.refillDocs(Lucene41PostingsReader.java:696)
        at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$BlockDocsAndPositionsEnum.nextDoc(Lucene41PostingsReader.java:752)
        at org.apache.lucene.codecs.MappingMultiDocsAndPositionsEnum.nextDoc(MappingMultiDocsAndPositionsEnum.java:104)
        at org.apache.lucene.codecs.PostingsConsumer.merge(PostingsConsumer.java:109)
        at org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:164)
        at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72)
        at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:399)
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:112)
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4163)
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3759)
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
        at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:106)
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
Caused by: java.io.IOException: Input/output error
        at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
        at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:699)
        at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:684)
        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:176)
        ... 18 more
[2017-04-05 04:30:49,461][WARN ][index.engine.internal    ] [node06] [index-2017.04.05][2] failed engine [merge exception]
[2017-04-05 04:30:49,970][WARN ][cluster.action.shard     ] [node06] [index-2017.04.05][2] sending failed shard for [index-2017.04.05][2], node[K8i-aoFATluUWFunOB671w], [P], s[STARTED], indexUUID [5FNalKcQR2K5F9AeNOB0Ew], reason [engine failure, message [merge exception][MergeException[java.io.IOException: Input/output error: NIOFSIndexInput(path="/apps/elasticsearch/disk3/elasticsearch/nodes/0/indices/index-2017.04.05/2/index/_q8_es090_0.doc")]; nested: IOException[Input/output error: NIOFSIndexInput(path="/apps/elasticsearch/disk3/elasticsearch/nodes/0/indices/index-2017.04.05/2/index/_q8_es090_0.doc")]; nested: IOException[Input/output error]; ]]

Regards,
David

On node06, can you find out the disk usage?

Linux

df -h

and

df -h /apps/elasticsearch/disk3/elasticsearch/nodes/0/indices/index-2017.04.05/2/index/_q8_es090_0.doc

and also provide the iostats / sar (preferably) if you have it.

df -h:
/dev/mapper/vg-root                           16G  1.9G   14G  13% /
udev                                          16G  4.0K   16G   1% /dev
tmpfs                                        6.3G  300K  6.3G   1% /run
none                                         5.0M     0  5.0M   0% /run/lock
none                                          16G     0   16G   0% /run/shm
overflow                                     1.0M  356K  668K  35% /tmp
/dev/mapper/vgsdb-lvsdb                      3.6T  1.2T  2.3T  35% /elasticsearch/disk2
/dev/mapper/vgsdc-lvsdc                      3.6T  1.2T  2.3T  35% /elasticsearch/disk3
/dev/mapper/vgsdd-lvsdd                      3.6T  1.3T  2.2T  36% /elasticsearch/disk4
/dev/mapper/vg-lvsda                         886G  567M  840G   1% /elasticsearch/disk1
/dev/sda1                                    198M   41M  147M  22% /boot

The other file you refer to is not present. I don't have access to historic instat / sar atm...

Regards,
David

I was hoping it was a disk space issue. From your df output it is not. Then with the sar i would be able to trace back to the historical info. Without it , it would be hard to say what was wrong with the I/O. Is disk 3 a NFS mount? Can you attach the dmesg file for the timeframe this problem occurred?

There was a disk space issue. The root volume filled, though not completely cos es could still write its log file. I'll look into the dmesg output...