Hi all! Spent the last month or so playing with elastic stack for my side project site. Its just a flask/react app with virtually 0 traffic, so nothing particularly fancy, but I like playing with new tech. So am deploying helm charts on a small k8s cluster.
Now, I don't particularly want to run 3 4GB servers to monitor my 1 server flask app. After all, if it fails- I really don't care that much. But my problem is that if I lose quorum, I have NO IDEA how to start the cluster back up. I've seen this with 3 node and 1 node clusters- if I lose quorum (and with all the experimentation with helm charts and terraform, I'm replacing instances all the time), the ES cluster never comes back up. I had assumed this was expcted behavior, and was thinking about trying to run 5 really small servers to make it happen less often. But according to https://twitter.com/paulbecotte/status/1212422340736360449, a 1 node cluster should be able to recover from a reboot without intervention.
So just now I went in and deleted my ES pod. I am using the elastic/elastic helm chart @ 7.5.1, so its deployed as a Statefulset (deleting the pod does not delete the data). K8s brings a new pod up immediately on the existing disk. The new node logs a bit and then stops and sits indefinitely (same behavior I noticed before).
I am including most of the logs that look relevant (big chunk in the middle just loading modules). Also, this is my side project- you can see the full code I used to bring this up at https://gitlab.com/devblog/infrastructure/blob/master/argocharts/elastic/values.yaml#L1. Basically off-the-shelf helm with some config copied from the "security" example. The previous times this has happened, I fixed it by nuking the whole deployment- removing the persistent volume and the pod at the same time so it starts over from scratch. That would actually be fine in this case too haha! But the only thing the docs say is to restore from a snapshot, and I really don't want to mess around with backing up this data just to make it easier to manually fix it when I break it. Ideally, it could fix itself
So let me know if this is unexpected behavior, or if you have any idea what I did wrong to make my cluster unusually fragile.
{"type": "server", "timestamp": "2020-01-03T01:17:27,379Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "version[7.5.1], pid[1], build[default/docker/3ae9ac9a93c95bd0cdc054951cf9
5d88e1e18d96/2019-12-16T22:57:37.835892Z], OS[Linux/4.4.0-169-generic/amd64], JVM[Adopt
OpenJDK/OpenJDK 64-Bit Server VM/13.0.1/13.0.1+9]" }
{"type": "server", "timestamp": "2020-01-03T01:17:27,379Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "JVM home [/usr/share/elasticsearch/jdk]" }
{"type": "server", "timestamp": "2020-01-03T01:17:27,380Z", "level": "INFO", "component
": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0"
, "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cach
e.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encodi
ng=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -
Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.n
etty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.j
mx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:C
MSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=
/tmp/elasticsearch-14555452503184840303, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpP
ath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=lo
gs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/
, -Xmx350M, -Xms350M, -XX:MaxDirectMemorySize=183500800, -Des.path.home=/usr/share/elas
ticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=def
ault, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }
... activating modules...
{"type": "server", "timestamp": "2020-01-03T01:17:52,465Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "starting ..." }
{"type": "server", "timestamp": "2020-01-03T01:17:52,879Z", "level": "INFO", "component": "o.e.t.TransportService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "publish_address {10.42.6.72:9300}, bound_addresses {0.0.0.0:9300}" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,084Z", "level": "INFO", "component": "o.e.b.BootstrapChecks", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "bound or publishing to a non-loopback address, enforcing bootstrap checks" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,088Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "cluster UUID [YWzI-AORQ7W4-dmByOTWMg]" }
{"type": "server", "timestamp": "2020-01-03T01:17:54,471Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "elected-as-master ([1] nodes joined)[{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 2, version: 174, delta: master node changed {previous [], current [{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20}]}" }
{"type": "server", "timestamp": "2020-01-03T01:17:55,262Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-0}{ajuGxzySQwWXjzM7kkHIPQ}{45DL3ympT2ec548D14yB7A}{10.42.6.72}{10.42.6.72:9300}{dilm}{ml.machine_memory=786432000, xpack.installed=true, ml.max_open_jobs=20}]}, term: 2, version: 174, reason: Publication{term=2, version=174}" }
{"type": "server", "timestamp": "2020-01-03T01:17:55,547Z", "level": "INFO", "component": "o.e.h.AbstractHttpServerTransport", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "publish_address {10.42.6.72:9200}, bound_addresses {0.0.0.0:9200}", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM
7kkHIPQ" }
{"type": "server", "timestamp": "2020-01-03T01:17:55,548Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "started", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQ
wWXjzM7kkHIPQ" }
{"type": "server", "timestamp": "2020-01-03T01:17:57,086Z", "level": "INFO", "component": "o.e.l.LicenseService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "license [13c6dc7b-b760-4a4f-ab39-ede339a3bbd6] mode [basic] - valid", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ" }
{"type": "server", "timestamp": "2020-01-03T01:17:57,088Z", "level": "INFO", "component": "o.e.x.s.s.SecurityStatusChangeListener", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Active license is now [BASIC]; Security is enabled", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ"}
{"type": "server", "timestamp": "2020-01-03T01:17:57,158Z", "level": "INFO", "component": "o.e.g.GatewayService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "recovered [11] indices into cluster_state", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ" }
{"type": "server", "timestamp": "2020-01-03T01:18:06,979Z", "level": "INFO", "component": "o.e.c.r.a.AllocationService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[metricbeat-7.5.1-2019.12.31-000001][0]]]).", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": "ajuGxzySQwWXjzM7kkHIPQ" }
{"type": "server", "timestamp": "2020-01-03T01:30:00,117Z", "level": "INFO", "component": "o.e.x.s.SnapshotRetentionTask", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "starting SLM retention snapshot cleanup task", "cluster.uuid": "YWzI-AORQ7W4-dmByOTWMg", "node.id": ajuGxzySQwWXjzM7kkHIPQ" }