Elastic cluster went to RED State

Hi Team,

I have been running my elk stack in docker. I have activated 30 days free license.
Suddenly my elastic cluster went to RED State. JVM Heap & Disk space is good.

Please find the error message as below,
[2018-08-06T18:38:36,384][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_xpack_license_expiration], reason [all shards failed]
[2018-08-06T18:38:36,389][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_cluster_status], reason [all shards failed]
[2018-08-06T18:38:36,392][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_logstash_version_mismatch], reason [all shards failed]
[2018-08-06T18:38:36,386][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_kibana_version_mismatch], reason [all shards failed]
[2018-08-06T18:38:36,403][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_nodes], reason [all shards failed]
[2018-08-06T18:38:36,403][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_version_mismatch], reason [all shards failed]
[2018-08-06T18:38:43,828][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:38:43,844][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:38:52,351][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:38:52,429][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:04,399][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:04,414][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:13,818][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:13,839][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:22,643][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:22,665][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:34,622][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:34,643][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:35,960][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,961][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_kibana_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,963][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_cluster_status], reason [all shards failed]
[2018-08-06T18:39:35,964][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_logstash_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,967][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_nodes], reason [all shards failed]
[2018-08-06T18:39:36,461][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_xpack_license_expiration], reason [all shards failed]
[2018-08-06T18:39:44,033][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:44,114][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data
[2018-08-06T18:39:54,170][ERROR][o.e.x.m.c.c.ClusterStatsCollector] [H1J9vrM] collector [cluster_stats] failed to collect data
[2018-08-06T18:39:54,183][ERROR][o.e.x.m.c.m.JobStatsCollector] [H1J9vrM] collector [job_stats] failed to collect data

Find the monitoring status in the below screenshot.

Could you please some help me to fix this issue.

I don't think there is enough information to help you here. If you "dockerized" your ES instance, you should provide details on the volume mounts and whether the data is persistent or ephemeral.

First research causes of red cluster health: https://www.elastic.co/guide/en/elasticsearch/reference/current/_cluster_health.html

Then check off all things that could go wrong to cause it and you'll likely find cause. My guess is it has something to do with the volume mount for disk, but just a guess.

Thanks for the response.

I have 1 TB Mount space for storing elastic data. Currently 37 GB space has been used.
I am doubting about my x-pack. Could see some errors related to x-pack.

[2018-08-06T18:39:35,960][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,961][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_kibana_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,963][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_cluster_status], reason [all shards failed]
[2018-08-06T18:39:35,964][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_logstash_version_mismatch], reason [all shards failed]
[2018-08-06T18:39:35,967][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_elasticsearch_nodes], reason [all shards failed]
[2018-08-06T18:39:36,461][ERROR][o.e.x.w.i.s.ExecutableSearchInput] [H1J9vrM] failed to execute [search] input for watch [tfRB-EFoQlyK_9HtdksYKw_xpack_license_expiration], reason [all shards failed]

@mikesparr - Hi Mikesparr,

Also, I could see more number of "unassigned_shards" and "active_shards" in every index.
Is there any way to fix this issue without data loss?

Please find the node health status as below,

{
"status":"red",
"number_of_shards":5,
"number_of_replicas":1,
"active_primary_shards":0,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":10,
"shards":{
"0":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"1":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"2":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"3":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"4":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
}
}
}

"stores_fopr_cri_noncritical_status-2018.07.07":
{
"status":"red",
"number_of_shards":5,
"number_of_replicas":1,
"active_primary_shards":0,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":10,
"shards":{
"0":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"1":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"2":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"3":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
},
"4":{
"status":"red",
"primary_active":false,
"active_shards":0,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":2
}
}
}
"stores_fopr_cri_noncritical_status-2018.07.16":
{
"status":"yellow",
"number_of_shards":5,
"number_of_replicas":1,
"active_primary_shards":5,
"active_shards":5,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":5,
"shards":{
"0":{
"status":"yellow",
"primary_active":true,
"active_shards":1,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":1
},
"1":{
"status":"yellow",
"primary_active":true,
"active_shards":1,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":1
},
"2":{
"status":"yellow",
"primary_active":true,
"active_shards":1,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":1
},
"3":{
"status":"yellow",
"primary_active":true,
"active_shards":1,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":1
},
"4":{
"status":"yellow",
"primary_active":true,
"active_shards":1,
"relocating_shards":0,
"initializing_shards":0,
"unassigned_shards":1
}
}
}

Could you please help me out with this!

I wonder if you have a single shared persistent storage that all nodes are writing to, instead of each having their own volume claims. Since multiple nodes will all try to write to the same paths, if they shared a single persistent storage you could get overwrites and likely result in data loss.

Your Kibana dashboard displayed "1" node so if that is it, then you might be okay. The files are probably stored in /var/lib/elasticsearch so perhaps it's worth spinning up a new node with different discovery settings so it doesn't try to cluster with your other node. Then copy the files into it's file system and see if they recover.

Other than that, the gurus at Elastic may have better ideas how to recover.

Your cluster is red because some of your indices are red.
Your indices are red because they are missing shards (primaries).

The most likely explanation for why this could happen is that your cluster knows that it's supposed to have data for those shards, but it simply doesn't have it any more. That can happen if you delete files, have a disk failure, or take nodes offline without ensuring you have sufficient replicas.

You can use the https://www.elastic.co/guide/en/elasticsearch/reference/6.3/cluster-allocation-explain.html to try and get more info about those shards, but I suspect the answer is going to be that the Cluster State knows about those indices, but none of the connected nodes have the data.

I think you're going to need to track back to your Suddenly my elastic cluster went to RED State. What triggered that? Did a node shutdown, did a disk fail, did a critical directory get deleted?

@TimV, @mikesparr

The worst-case scenario in my single node cluster having "11k" unassigned shards currently. :disappointed: So the alternatively I have followed the below-mentioned steps,

Step1. Created the template ( changed the default shard settings to One )

curl -XPUT "http://localhost:9200/_template/template_2" -H 'Content-Type: application/json' -d'
{
"index_patterns" : ["stores*"],
"settings" : {
"number_of_shards" : 1
}
}'

Step2. Executed the _reindex API to apply the new template. Target index has created with 1 shards and 1 replica but index status is "RED" with zero document. :weary:

red open stores_status-2018.08.08-1 Zb4_ciMxRmevIHmZFzn7dw 1 1

curl -XPOST "http://localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "stores_status-2018.08.08"
},
"dest": {
"index": "stores_status-2018.08.08-1"
}
}'

My Cluster Health:
{
"cluster_name" : "fopr-cluster",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 2456,
"active_shards" : 2456,
"relocating_shards" : 0,
"initializing_shards" : 4,
"unassigned_shards" : 11159,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 20,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 1100848,
"active_shards_percent_as_number" : 18.033629488214995
}

how do I proceed further? :disappointed: Could some please help me to fix this issue.

Hi Team, @magnusbaeck, @mikesparr, @TimV

I have used /_cluster/routing settings API and deleted all the unassigned shards to bring the node to the normal state (Yellow). But i lost all the existing visualizations and dashboards.:disappointed:

curl -XPUT 'http://localhost:9200/_cluster/settings' -H 'Content-Type: application/json' -d'
{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}'

So before going to the production server, I would like to understand the proper elastic cluster setup. Could you please help me with this.

Please find our ELK server details as below,

OS Red Hat 7
RAM 62 GB
CPU 16 core
Space allocated for elastic data 1 TB
Elastic Node Single Node
Log retention period 3 Months
No of Indices Creating Daily 30 to 40

What is the number of Shards & Replicas that I can have in my single node elastic cluster? And Right now am using dockerization concept. Is that feasible solution to make another container for creating one more elastic node in the same server?

You need to work out what happened to your node to get it into this state. This isn't something that happens randomly - and unless you find out what caused it it is likely to happen again.
Something caused your cluster to lose shards. Since you've deleted them it's not going to be easy to get any more information about what was going on, but you might find something useful in the logs.

Single node clusters are never a good idea. You cannot have any replicas. If you get an index corruption then you will not have any way to recover, and if your single node goes down then your cluster is completely unavailable.
Our very strong recommendation is to have a minimum of 3 nodes in a cluster.

To get more details on what was the problem, repeat your steps till you get to "unassigned shards" and run this API _cluster/allocation/explain?pretty
It tells reason for unassigned shards.

It seems to me that you have far too many small indices and shards, which is very inefficient. Please read this blog post about shards and sharding practices. Creating 30-40 daily indices seems excessive given the data volumes you have.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.