Elasticsearch in K8s cannot restart because of recovery failure (multiple alias write indexes)

Hello,

I have an elastic search cluster deployed in the k8s cluster:

"version" : {
"number" : "6.7.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "2f32220",
"build_date" : "2019-04-02T15:59:27.961366Z",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
 },

I had a problem, probably solved in here: CircuitBreakingException: [parent] Data too large, data for [<transport_request>], so I've performed a full restart of the es-cluster (I've removed all of the podes and k8s deployments have restarted them).

However now the cluster cannot get up, because of the java.lang.IllegalStateException: alias [mongo] has more than one write index [mongo-2020.01.03-000033,mongo-2019.08.29-000015]

Full log of the master node here: https://pastebin.com/raw/qYpkjn9B

It seems obvious to delete one of the write indexes, however I don't know how to do it and what type of index it is. https://github.com/elastic/elasticsearch/blob/6.7/server/src/main/java/org/elasticsearch/cluster/metadata/MetaData.java suggests that it is some kind of the metadata, however my cluster state says: https://pastebin.com/zipTQ6Tk

so it seems that i have a block on the metadata write/read.

How can i fix it? I cannot remove the data, because it is a production cluster

Thanks for the answer

Hi @MF57,

the cluster state dump has two nodes named "es-master-*", I wonder if you have had a split-brain situation due to this? Is discovery.zen.minimum_master_nodes set correctly to 2 for this setup or was it ever wrong?

It might look like one version of the cluster state has the mongo alias with write index for mongo-2020.01.03-000033 and the other has the mongo alias with write index for mongo-2019.08.29-000015. Looking at the date for the index name, maybe an old master has been resurrected and joined the cluster after the full restart?

Would be good to also see the log file from the other master node as well as settings (in particular minimum_master_nodes).

Hi @HenningAndersen

Thank you for your answer. The discovery.zen.minimum_master_nodes is currently set to 2, however i have no knowledge if it was always set to 2 in the past.

Also since the full restart was done by removing the k8s pods I don't believe that an old master could be resurrected, because I can't imagine how - maybe I am wrong though.

I am providing full logs and configurations of the cluster (I've hidden cluster name but its the same everywhere):

GET cluster/_state   https://pastebin.com/ktsDxKCy

GET /_nodes https://pastebin.com/utRrzb7U

GET cluster/_settings?include_defaults=true  https://pastebin.com/yR3RWGuv

Nodes:
All of the elasticsearch.yml are the same, but there different env values (provided in the comments of the config file)

es-master-6f6bf6f789-jn4rd

Config: https://pastebin.com/3uNH3mRT
Logs: https://pastebin.com/TX5nF8xY

es-master-6f6bf6f789-zwwpk

Config: https://pastebin.com/mweMTEQm
Logs: https://pastebin.com/i6Vz1UBy

es-data-0

Config: https://pastebin.com/LFwsy7Ka
Logs: https://pastebin.com/NKX21BKX

es-data-1

Config: https://pastebin.com/vJGj1Biw
Logs: https://pastebin.com/WnVj0WkU

es-client-86655db574-xlbxv

Config: https://pastebin.com/7bCyeV26
Logs: https://pastebin.com/yzZhdNpB

Hi @MF57,

I believe we need to manually fix this to get it running again. We should be able to find the UUID of the offending index by enabling trace logging (either globally or for org.elasticsearch.gateway).

Feel free to PM me the resulting log files.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.