Help!
I am running a kubernetes cluster with elastic on it and ran into this master_not_discovered_exception
error recently.
I created the cluster using these steps:
# chart.monitaur.net
helm repo add elastic https://helm.elastic.co
helm repo add kiwigrid https://kiwigrid.github.io
helm repo update
## Bring up ES
helm install \
--name elasticsearch \
elastic/elasticsearch \
--namespace escluster \
--version 7.3.0 \
--set resources.requests.memory='8Gi' \
--set resources.limits.memory='12Gi' \
--set esJavaOpts='-Xmx4g -Xms4g' \
--set replicas=5 \
--set volumeClaimTemplate.storageClassName='openebs-cstor-elasticsearch' \
--set volumeClaimTemplate.resources.requests.storage='500Gi'
Everything was running fine for over a month, then one day data just stops in one of my charts. However, all of the previous data is still searchable and attainable in the kibana interface.
To investigate I log directly into the kibana pod from the running node
bash-4.2$ curl -s -v -X GET "elasticsearch-master:9200/_cat/health?v&pretty"
* About to connect() to elasticsearch-master port 9200 (#0)
* Trying 10.98.144.19...
* Connected to elasticsearch-master (10.98.144.19) port 9200 (#0)
> GET /_cat/health?v&pretty HTTP/1.1
> User-Agent: curl/7.29.0
> Host: elasticsearch-master:9200
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 228
< x-envoy-upstream-service-time: 90157
< date: Sat, 05 Oct 2019 20:06:36 GMT
< server: envoy
<
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
* Connection #0 to host elasticsearch-master left intact
bash-4.2$ hostname
kibana-kibana-578f8
Looking further into the issue I find that the disk has filled on one of the nodes. I log into that node and delete a few extraneous things so that there is plenty of free space (+50G). I then kubectl delete pod
the pods that are in CrashLoopBackoff
. They are then rescheduled and everything appears to be fine in my cluster. Except logging into any of the masters and curling anything gives me this ominous master_not_discovered_exception
same as above.
curl -s -v -X GET localhost:9200/_cat/indices Sat Oct 5 20:20:24 2019
* About to connect() to localhost port 9200 (#0)
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9200 (#0)
> GET /_cat/indices HTTP/1.1
> User-Agent: curl/7.29.0
> Host: localhost:9200
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-type: application/json; charset=UTF-8
< content-length: 151
<
{ [data not shown]
* Connection #0 to host localhost left intact
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
but just curling port 9200 with no URI the server does return the standard:
[elasticsearch@elasticsearch-master-1 ~]$ curl -X GET "localhost:9200/"
{
"name" : "elasticsearch-master-1",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "U1DK8",
"version" : {
"number" : "7.3.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "de777fa",
"build_date" : "2019-07-24T18:30:11.767338Z",
"build_snapshot" : false,
"lucene_version" : "8.1.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
And so does curling port 9200 at elasticsearch-master, and doing so repeatedly I can see that all nodes are chiming in from the "name": field.
e.g. here is elasticsearch-master-3
responding to elasticsearch-master-1
:
[elasticsearch@elasticsearch-master-1 ~]$ curl -X GET "elasticsearch-master:9200/"
{
"name" : "elasticsearch-master-3",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "U1DK8",
"version" : {
"number" : "7.3.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "de777fa",
"build_date" : "2019-07-24T18:30:11.767338Z",
"build_snapshot" : false,
"lucene_version" : "8.1.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
This is successful from all 5 masters. And I can access all old data in kibana, though this might be cached. Yet, I do not seem to be able to insert new data. No saves of new objects in kibana, and manually curling new data into the cluster does not work.
Looking at the snapshots and restore page of kibana is fruitless and times out to a blank frame in the middle with the same header and menu as usual on the left.
Attempting to PUT a backup repo in manually results in the same not_discovered exception:
[elasticsearch@elasticsearch-master-1 ~]$ cat /tmp/backup
#!/bin/bash
curl -X PUT "localhost:9200/_snapshot/my_backup?pretty" -H 'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": "/tmp/es_backup"
}
}
'
[elasticsearch@elasticsearch-master-1 ~]$ bash /tmp/backup
{
"error" : {
"root_cause" : [
{
"type" : "master_not_discovered_exception",
"reason" : null
}
],
"type" : "master_not_discovered_exception",
"reason" : null
},
"status" : 503
}
What can I do to make this cluster happy again? Or at least backup its data before I blow the whole thing away?