Cluster recovery inside K8s

Hi all! Spent the last month or so playing with elastic stack for my side project site. Its just a flask/react app with virtually 0 traffic, so nothing particularly fancy, but I like playing with new tech. So am deploying helm charts on a small k8s cluster.

Now, I don't particularly want to run 3 4GB servers to monitor my 1 server flask app. After all, if it fails- I really don't care that much. But my problem is that if I lose quorum, I have NO IDEA how to start the cluster back up. I've seen this with 3 node and 1 node clusters- if I lose quorum (and with all the experimentation with helm charts and terraform, I'm replacing instances all the time), the ES cluster never comes back up. I had assumed this was expcted behavior, and was thinking about trying to run 5 really small servers to make it happen less often. But according to https://twitter.com/paulbecotte/status/1212422340736360449, a 1 node cluster should be able to recover from a reboot without intervention.

So just now I went in and deleted my ES pod. I am using the elastic/elastic helm chart @ 7.5.1, so its deployed as a Statefulset (deleting the pod does not delete the data). K8s brings a new pod up immediately on the existing disk. The new node logs a bit and then stops and sits indefinitely (same behavior I noticed before).

I am including most of the logs that look relevant (big chunk in the middle just loading modules). Also, this is my side project- you can see the full code I used to bring this up at https://gitlab.com/devblog/infrastructure/blob/master/argocharts/elastic/values.yaml#L1. Basically off-the-shelf helm with some config copied from the "security" example. The previous times this has happened, I fixed it by nuking the whole deployment- removing the persistent volume and the pod at the same time so it starts over from scratch. That would actually be fine in this case too haha! But the only thing the docs say is to restore from a snapshot, and I really don't want to mess around with backing up this data just to make it easier to manually fix it when I break it. Ideally, it could fix itself :slight_smile:

So let me know if this is unexpected behavior, or if you have any idea what I did wrong to make my cluster unusually fragile.

{"type": "server", "timestamp": "2020-01-03T01:17:27,379Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "version[7.5.1], pid[1], build[default/docker/3ae9ac9a93c95bd0cdc054951cf9
5d88e1e18d96/2019-12-16T22:57:37.835892Z], OS[Linux/4.4.0-169-generic/amd64], JVM[Adopt
OpenJDK/OpenJDK 64-Bit Server VM/13.0.1/13.0.1+9]" }
{"type": "server", "timestamp": "2020-01-03T01:17:27,379Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "JVM home [/usr/share/elasticsearch/jdk]" }
{"type": "server", "timestamp": "2020-01-03T01:17:27,380Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cach
e.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encodi
ng=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -
Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.n
etty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.j
mx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:C
MSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=
/tmp/elasticsearch-14555452503184840303, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpP
ath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=lo
gs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/
, -Xmx350M, -Xms350M, -XX:MaxDirectMemorySize=183500800, -Des.path.home=/usr/share/elas
ticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=def
ault, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }

... activating modules...

{"type": "server", "timestamp": "2020-01-03T01:17:52,465Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "starting ..." }
{"type": "server", "timestamp": "2020-01-03T01:17:52,879Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "publish_address {10.42.6.72:9300}, bound_addresses {0.0.0.0:9300}" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,084Z", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,088Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "cluster UUID [YWzI-AORQ7W4-dmByOTWMg]" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,471Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "elected-as-master ([1] nodes joined)[{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 2, version: 174, delta: master node changed {previous [], current [{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20}]}" }
{"type": "server", "timestamp": "2020-01-03T01:17:55,262Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20}]}, term: 2, version: 174, reason: Publication{term=2, version=174}" }
{"type": "server", "timestamp": "2020-01-03T01:17:55,547Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "publish_address {10.42.6.72:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM
7kkHIPQ"  }
{"type": "server", "timestamp": "2020-01-03T01:17:55,548Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "started", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQ
wWXjzM7kkHIPQ"  }
{"type": "server", "timestamp": "2020-01-03T01:17:57,086Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "license [13c6dc7b-b760-4a4f-ab39-ede339a3bbd6] mode [basic] - valid", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ"  }
{"type": "server", "timestamp": "2020-01-03T01:17:57,088Z", "level": "INFO", "component": "o.e.x.s.s.SecurityStatusChangeListener", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Active license is now [BASIC]; Security is enabled", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ"}
{"type": "server", "timestamp": "2020-01-03T01:17:57,158Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "recovered [11] indices into cluster_state", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ"  }
{"type": "server", "timestamp": "2020-01-03T01:18:06,979Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[metricbeat-7.5.1-2019.12.31-000001][0]]]).", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ"  }
{"type": "server", "timestamp": "2020-01-03T01:30:00,117Z", "level": "INFO", "component": "o.e.x.s.SnapshotRetentionTask", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "starting SLM retention snapshot cleanup task", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": ajuGxzySQwWXjzM7kkHIPQ"  }

This one-node cluster looks to have started up successfully. It's normal that it doesn't log anything more once it's finished starting up like this. Can you clarify what exactly you mean by "the ES cluster never comes back up"?

The other services (metricbeat, kibana, etc...) are all down ATM saying they can't contact the elasticsearch cluster, and I get Kubernetes events saying the server hasn't passed its health checks (though it is not getting cyclec/recreated).

Readiness probe failed: Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" ) Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )

If its expected that the log output should stop there... maybe the issue is just the health check being too aggressive. That is rendering as

  readinessProbe:
    exec:
      command:
        - sh
        - '-c'
        - >
          #!/usr/bin/env bash -e

          # If the node is starting up wait for the cluster to be ready
          (request params: 'wait_for_status=green&timeout=1s' )

          # Once it has started only check that the node itself is
          responding

          START_FILE=/tmp/.es_start_file


          http () {
              local path="${1}"
              if [ -n "${ELASTIC_USERNAME}" ] && [ -n "${ELASTIC_PASSWORD}" ]; then
                BASIC_AUTH="-u ${ELASTIC_USERNAME}:${ELASTIC_PASSWORD}"
              else
                BASIC_AUTH=''
              fi
              curl -XGET -s -k --fail ${BASIC_AUTH} https://127.0.0.1:9200${path}
          }


          if [ -f "${START_FILE}" ]; then
              echo 'Elasticsearch is already running, lets check the node is healthy and there are master nodes available'
              http "/_cluster/health?timeout=0s"
          else
              echo 'Waiting for elasticsearch cluster to become cluster to be ready (request params: "wait_for_status=green&timeout=1s" )'
              if http "/_cluster/health?wait_for_status=green&timeout=1s" ; then
                  touch ${START_FILE}
                  exit 0
              else
                  echo 'Cluster is not yet ready (request params: "wait_for_status=green&timeout=1s" )'
                  exit 1
              fi
          fi

Yes I think this health check is too aggressive. It is waiting for green health (all shards assigned) but your cluster health is yellow (all primaries assigned, but some replicas are unassigned), and since there's only one node there is nowhere to assign the replicas and the cluster will never become green. It works the first time on an empty cluster because an empty cluster has green health, but as soon as you create some indices with replicas it will fail.

One solution is not to create indices with replicas, but I also think the readiness check can be weakened. I think GET /_cluster/health?timeout=0s is a better idea since it checks that there is an elected master node, but ECK uses simply GET /.

1 Like

Indeed! Relaxing that healthcheck was enough to let the stack recover. Thinking about the last time, even with three nodes, need the first to pass the check before k8s brings up the second, and one node will always be yellow by itself.

For anybody finding this from google, you can set

clusterHealthCheckParams: "wait_for_status=yellow&timeout=1s"

in the elastic helm chart to approach this problem.

I'm not sure this is ok for a cluster with more than one node if the nodes start up one-at-a-time and wait for the check to pass. With two or more nodes you need more than half of them (so >1) to be running to elect a master and the cluster health might be red until all the data nodes are running.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.