Elasticsearch cluster health is fluctuating between yellow and red

Hi folks,

We are facing a weird problem in one of our ES database server. Cluster health is constantly fluctuating between RED and Yellow. Need help to identify/root cause the same. Is it corruption issue/ Misconfiguration or anything else?

ES version is 5.6. Both ES and application on same server.

Cluster state as below:

{
"cluster_name" : "test-cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 6,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 40.0
}

Please help.

Looks like it is a single node cluster, is that correct? What type of storage are you using?

Yes its a single node cluster. We are using normal server HDD's.

What is in the Elasticsearch logs?

Cluster log shows below error:

[2019-10-09T21:36:23,484][ERROR][o.e.i.e.InternalEngine$EngineMergeScheduler] [scbuilds4u-node-01] [builds4u_v09][2] failed to merge
java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot complete merge

  • at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:3740) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]*
  • at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4513) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]*
  • at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]*
  • at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]*

Problematic shard is:

v09 3 p STARTED 2108269 1.1gb 10.170.61.12 node-01
v09 3 r UNASSIGNED
v09 4 p STARTED 2124252 1.3gb 10.170.61.12 node-01
v09 4 r UNASSIGNED
**v09 2 p UNASSIGNED **
v09 2 r UNASSIGNED
v09 1 p STARTED 2141168 1.1gb 10.170.61.12 node-01
v09 1 r UNASSIGNED
v09 0 p STARTED 2138709 1gb 10.170.61.12 node-01

How can we make it assisned again?

What does the cluster allocation explain API tell you about this missing primary shard? How much disk space do you have left?

Disk space details:

[root@web01 4u]# du -sh /var/lib/elasticsearch/
5.8G /var/lib/elasticsearch/

[root@4u]# df -Ph /var/lib/elasticsearch/
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/cl-root 50G 20G 31G 40% /

Response of Cluster allocation explain API:

{
"index" : "4u_v09",
"shard" : 3,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2019-10-10T05:36:19.012Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "r5PCzH0_SUCOPGjP8KgcKg",
"node_name" : "node-01",
"transport_address" : "10.120.61.12:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[4u_v09][3], node[r5PCzH0_SUCOPGjP8KgcKg], [P], s[STARTED], a[id=wcJbPOn6QSWrS2Pv5NkhyQ]]"
}
]
}
]
}

I think we need to see the full error message from the logs. You've only shared the top few lines for some reason. Without the whole error message we can only speculate about the problem.

Is there a way to attach log file?

Tried to paste entire log but restriction of char limit is not allowing. Can this be helpful?

[2019-10-09T21:36:23,484][ERROR][o.e.i.e.InternalEngine$EngineMergeScheduler] [4u-node-01] [v09][2] failed to merge
java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot complete merge
	at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:3740) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4513) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3931) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99) ~[elasticsearch-5.6.0.jar:5.6.0]
	at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661) [lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
Caused by: java.lang.RuntimeException: java.io.EOFException: read past EOF: MMapIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/eiHi9IlBRZGrUmkgFQVxDQ/2/index/_2ju6.cfs") [slice=_2ju6_Lucene54_0.dvd] [slice=var-binary]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$6.get(Lucene54DocValuesProducer.java:740) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$LongBinaryDocValues.get(Lucene54DocValuesProducer.java:1197) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$7.lookupOrd(Lucene54DocValuesProducer.java:804) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.SortedDocValuesTermsEnum.next(SortedDocValuesTermsEnum.java:83) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:224) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiTermsEnum.reset(MultiTermsEnum.java:113) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiDocValues$OrdinalMap.<init>(MultiDocValues.java:552) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiDocValues$OrdinalMap.build(MultiDocValues.java:511) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedSetField(DocValuesConsumer.java:808) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:221) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:153) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:167) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:111) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4356) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	... 4 more
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/var/lib/elasticsearch/nodes/0/indices/eiHi9IlBRZGrUmkgFQVxDQ/2/index/_2ju6.cfs") [slice=_2ju6_Lucene54_0.dvd] [slice=var-binary]
	at org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:98) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$6.get(Lucene54DocValuesProducer.java:736) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$LongBinaryDocValues.get(Lucene54DocValuesProducer.java:1197) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.lucene54.Lucene54DocValuesProducer$7.lookupOrd(Lucene54DocValuesProducer.java:804) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.SortedDocValuesTermsEnum.next(SortedDocValuesTermsEnum.java:83) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.FilteredTermsEnum.next(FilteredTermsEnum.java:224) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiTermsEnum.reset(MultiTermsEnum.java:113) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiDocValues$OrdinalMap.<init>(MultiDocValues.java:552) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.index.MultiDocValues$OrdinalMap.build(MultiDocValues.java:511) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]
	at org.apache.lucene.codecs.DocValuesConsumer.mergeSortedSetField(DocValuesConsumer.java:808) ~[lucene-core-6.6.0.jar:6.6.0 5c7a7b65d2aa7ce5ec96458315c661a18b320241 - ishan - 2017-05-30 07:29:46]

Yes, that's helpful, thanks. I reformatted your post to make it possible to read. This exception indicates that the shard is corrupt. I would recommend deleting the index and restoring it from a snapshot. Did you recently have a power outage or other sudden shutdown?