Healthy cluster is completely hosed by one node failing

Help!

I am running a kubernetes cluster with elastic on it and ran into this master_not_discovered_exception error recently.
I created the cluster using these steps:

  # chart.monitaur.net
  helm repo add elastic https://helm.elastic.co
  helm repo add kiwigrid https://kiwigrid.github.io
  helm repo update

  ## Bring up ES
  helm install \
    --name elasticsearch \
    elastic/elasticsearch \
    --namespace escluster \
    --version 7.3.0 \
    --set resources.requests.memory='8Gi' \
    --set resources.limits.memory='12Gi' \
    --set esJavaOpts='-Xmx4g -Xms4g' \
    --set replicas=5 \
    --set volumeClaimTemplate.storageClassName='openebs-cstor-elasticsearch' \
    --set volumeClaimTemplate.resources.requests.storage='500Gi'

Everything was running fine for over a month, then one day data just stops in one of my charts. However, all of the previous data is still searchable and attainable in the kibana interface.

To investigate I log directly into the kibana pod from the running node

bash-4.2$ curl -s -v -X GET "elasticsearch-master:9200/_cat/health?v&pretty"
* About to connect() to elasticsearch-master port 9200 (#0)
*   Trying 10.98.144.19...
* Connected to elasticsearch-master (10.98.144.19) port 9200 (#0)
> GET /_cat/health?v&pretty HTTP/1.1
> User-Agent: curl/7.29.0
> Host: elasticsearch-master:9200
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 228
< x-envoy-upstream-service-time: 90157
< date: Sat, 05 Oct 2019 20:06:36 GMT
< server: envoy
<
{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}
* Connection #0 to host elasticsearch-master left intact
bash-4.2$ hostname
kibana-kibana-578f8

Looking further into the issue I find that the disk has filled on one of the nodes. I log into that node and delete a few extraneous things so that there is plenty of free space (+50G). I then kubectl delete pod the pods that are in CrashLoopBackoff. They are then rescheduled and everything appears to be fine in my cluster. Except logging into any of the masters and curling anything gives me this ominous master_not_discovered_exception same as above.

curl -s -v -X GET localhost:9200/_cat/indices                                                                           Sat Oct  5 20:20:24 2019
* About to connect() to localhost port 9200 (#0)
*   Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9200 (#0)
> GET /_cat/indices HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 151
<
{ [data not shown]
* Connection #0 to host localhost left intact
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

but just curling port 9200 with no URI the server does return the standard:

[elasticsearch@elasticsearch-master-1 ~]$ curl -X GET "localhost:9200/"
{
  "name" : "elasticsearch-master-1",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "U1DK8",
  "version" : {
    "number" : "7.3.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "de777fa",
    "build_date" : "2019-07-24T18:30:11.767338Z",
    "build_snapshot" : false,
    "lucene_version" : "8.1.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

And so does curling port 9200 at elasticsearch-master, and doing so repeatedly I can see that all nodes are chiming in from the "name": field.
e.g. here is elasticsearch-master-3 responding to elasticsearch-master-1:

[elasticsearch@elasticsearch-master-1 ~]$ curl -X GET "elasticsearch-master:9200/"
{
  "name" : "elasticsearch-master-3",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "U1DK8",
  "version" : {
    "number" : "7.3.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "de777fa",
    "build_date" : "2019-07-24T18:30:11.767338Z",
    "build_snapshot" : false,
    "lucene_version" : "8.1.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

This is successful from all 5 masters. And I can access all old data in kibana, though this might be cached. Yet, I do not seem to be able to insert new data. No saves of new objects in kibana, and manually curling new data into the cluster does not work.

Looking at the snapshots and restore page of kibana is fruitless and times out to a blank frame in the middle with the same header and menu as usual on the left.

Attempting to PUT a backup repo in manually results in the same not_discovered exception:

[elasticsearch@elasticsearch-master-1 ~]$ cat /tmp/backup 
#!/bin/bash
curl -X PUT "localhost:9200/_snapshot/my_backup?pretty" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/tmp/es_backup"
  }
}
'
[elasticsearch@elasticsearch-master-1 ~]$ bash /tmp/backup
{
  "error" : {
    "root_cause" : [
      {
        "type" : "master_not_discovered_exception",
        "reason" : null
      }
    ],
    "type" : "master_not_discovered_exception",
    "reason" : null
  },
  "status" : 503
}

What can I do to make this cluster happy again? Or at least backup its data before I blow the whole thing away?

further investigation in the logs reveals some lines like:

"Caused by: java.nio.file.FileSystemException: /usr/share/elasticsearch/data/nodes/0/_state/manifest-3937325.st.tmp: Read-only file system",

logging into the host where that master is running:

# echo test > /test1 && cat /test1 && ls -lh /test1 && rm /test1
test
-rw-r--r-- 1 root root 5 Oct  5 23:45 /test1

so that filesystem is good and plenty of free space.

Logging in the container itself docker exec -it 34bc /bin/bash is interesting

[elasticsearch@elasticsearch-master-4 ~]$ touch /usr/share/elasticsearch/data/nodes/0/_state/manifest-3937325.st.tmp
[elasticsearch@elasticsearch-master-4 ~]$ echo one >> /usr/share/elasticsearch/data/nodes/0/_state/manifest-3937325.st.tmp
[elasticsearch@elasticsearch-master-4 ~]$ cat /usr/share/elasticsearch/data/nodes/0/_state/manifest-3937325.st.tmp
cat: /usr/share/elasticsearch/data/nodes/0/_state/manifest-3937325.st.tmp: No such file or directory

no errors but anything I create in that _state just disappears? One dir up works fine:

[elasticsearch@elasticsearch-master-4 ~]$ echo one >> /usr/share/elasticsearch/data/nodes/0/manifest-3937325.st.tmp
[elasticsearch@elasticsearch-master-4 ~]$ ls /usr/share/elasticsearch/data/nodes/0/manifest-3937325.st.tmp
/usr/share/elasticsearch/data/nodes/0/manifest-3937325.st.tmp
[elasticsearch@elasticsearch-master-4 ~]$ cat /usr/share/elasticsearch/data/nodes/0/manifest-3937325.st.tmp
one

so the file-system is most certainly not read-only, but that _state dir is funny...

[elasticsearch@elasticsearch-master-4 ~]$ ls -alh /usr/share/elasticsearch/data/nodes/0                        
total 20K
drwxrwsr-x 4 elasticsearch elasticsearch 4.0K Oct  5 23:50 .
drwxrwsr-x 3 elasticsearch elasticsearch 4.0K Aug 22 18:33 ..
drwxrwsr-x 2 elasticsearch elasticsearch 4.0K Oct  5 23:52 _state
drwxrwsr-x 9 elasticsearch elasticsearch 4.0K Sep  7 01:11 indices
-rw-rw-r-- 1 elasticsearch elasticsearch    4 Oct  5 23:50 manifest-3937325.st.tmp
-rw-rw-r-- 1 elasticsearch elasticsearch    0 Aug 22 18:33 node.lock

What happens if you do the same experiment in wherever /usr/share/elasticsearch/data/nodes/0/_state/ exists on the host filesystem?

Also, when you are getting a master_not_discovered_exception there isn't much useful information available via APIs, because most APIs won't do much without a master. The logs are the place to look for details. If you need more help understanding the information in the logs then please share more logs!

Well, I have already blown the entire cluster away, I wish I had saved all the logs.

But I could not find anywhere that the filesystem was read-only. The thing that gets me is that there were 4 more master pods on separate hosts! And none of their hosts filesystems ever had any issues. So why was the entire cluster infected with stuff in the logs about read-only?

Elasticsearch has distributed exception handling, propagating exceptions from one node to another, so maybe that explains it? Without the logs we can only really speculate on what was actually going on, and therefore cannot say how we might avoid it in future.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.