ES 7.3 - restarting data node doesn't rejoin cluster

We have the following cluster:

3 x ES 7.3 on CentOS 7.8 VMs, "master eligible" + coordinating nodes
15 x ES 7.3 on CentOS 7.8 physical nodes, data-only
1 x ES 7.3 on CentOS 7.8 VM, running on the Kibana node purely to keep those lookups on localhost

The data hosts have two "drives" each - an OS drive and an XFS-formatted 3TB nvme mounted at /var/lib/elasticsearch

We can install OS patches and reboot the coordinating/"eligible" nodes all day, no problem.

Rebooting a physical node fails and the only way we've found to get it to rejoin the cluster is to:

  • stop Elasticsearch on that host (on reboot it fails to start so it's just a formality)
  • delete everything in /var/lib/elasticsearch
  • restart elasticsearch

At that point the node will happily rejoin the cluster, the indexes rebalance, all the sort of things you'd expect.

The only way for me to get the actual error is to intentionally fail one of our physical nodes. I can do that but before I do, I'm curious if this sounds familiar to anyone as a known-but-niche bug specific to ES 7.3.0 on CentOS 7.8 on XFS (or when /var/lib/elasticsearch is separate from the OS drive).

On the (literally) hundreds of installs I've done, and dozens of small clusters (<20 nodes) I've managed, I've never seen this and I haven't been able to replicate it. It doesn't matter which physical node goes down, if any one in that cluster reboots we end up having to remove all the data on that node and wait for everything to rebalance.

Any guidance on where to start troubleshooting this would be awesome. We do have a planned 7.3 -> 7.9/7.10 upgrade planned at the end of December but I'm really, REALLY hesitant to do that until we have a better idea of what's going on.

We're using the Basic licence, TLS with certs from our internal CA, each host has a routed IP for management and a non-routed IP that ES uses (to keep that traffic confined to that network, even if it is encrypted between hosts). I can post a somewhat redacted config if necessary but it's the same config we used on Ubuntu 18.04 LTS on the same hardware with no issues.

Thanks everyone!

Welcome to our community! :smiley:

What do the logs from the node that has restarted, and the current master node at that time show?

Thanks, Mark! We've chatted quite a bit, I'm kmwilcox_ on Twitter :wink:

Here's the log from the data node that wouldn't rejoin -- this set was repeated each time I'd do a

systemctl restart elasticsearch

and I did verify that after "native controller stopped", no elasticsearch processes were running.

    [2020-11-05T19:02:38,482][INFO ][o.e.n.Node               ] [siemdata15] stopping ...
    [2020-11-05T19:02:38,534][INFO ][o.e.x.w.WatcherService   ] [siemdata15] stopping watch service, reason [shutdown initiated]
    [2020-11-05T19:02:38,689][INFO ][o.e.c.c.Coordinator      ] [siemdata15] master node [{geardachadh}{sI-w37AFTASGHuh4-9GVxA}{2wS9KYSfQg-cTg_78sLMJA}{192.168.100.60}{192.168.100.60:9300}{im}{ml.machine_memory=16637517824, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
    org.elasticsearch.transport.NodeDisconnectedException: [geardachadh][192.168.100.60:9300][disconnected] disconnected
    [2020-11-05T19:02:38,697][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [siemdata15] [controller/6546] [Main.cc@150] Ml controller exiting
    [2020-11-05T19:02:38,700][INFO ][o.e.x.m.p.NativeController] [siemdata15] Native controller process has stopped - no new native processes can be started
    [2020-11-05T19:02:38,761][WARN ][o.e.i.s.RetentionLeaseSyncAction] [siemdata15] [itis-windows-desktops-2020.09.24][1] retention lease background sync failed
    org.elasticsearch.transport.SendRequestTransportException: [siemdata15][192.168.100.46:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
    [2020-11-05T19:02:38,948][WARN ][o.e.i.s.RetentionLeaseSyncAction] [siemdata15] [itis-windows-servers-duo-events-2020.12][3] retention lease background sync failed
    org.elasticsearch.transport.SendRequestTransportException: [siemdata15][192.168.100.46:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
    [2020-11-05T19:02:39,070][WARN ][o.e.i.s.RetentionLeaseSyncAction] [siemdata15] [itis-windows-desktops-2020.09.18][4] retention lease background sync failed
    org.elasticsearch.transport.SendRequestTransportException: [siemdata15][192.168.100.46:9300][indices:admin/seq_no/retention_lease_background_sync[p]]
    [2020-11-05T19:02:42,451][INFO ][o.e.n.Node               ] [siemdata15] stopped
    [2020-11-05T19:02:42,451][INFO ][o.e.n.Node               ] [siemdata15] closing ...
    [2020-11-05T19:02:42,493][INFO ][o.e.n.Node               ] [siemdata15] closed
    [2020-11-05T19:06:20,252][INFO ][o.e.e.NodeEnvironment    ] [siemdata15] using [1] data paths, mounts [[/var/lib/elasticsearch (/dev/nvme0n1)]], net usable_space [1.3tb], net total_space [2.9tb], types [xfs]
    [2020-11-05T19:06:20,256][INFO ][o.e.e.NodeEnvironment    ] [siemdata15] heap size [29.8gb], compressed ordinary object pointers [true]
    [2020-11-05T19:06:29,102][INFO ][o.e.n.Node               ] [siemdata15] node name [siemdata15], node ID [428f6b6dRAup9vP5p6vtkg], cluster name [siem]
    [2020-11-05T19:06:29,105][INFO ][o.e.n.Node               ] [siemdata15] version[7.3.0], pid[1322], build[default/rpm/de777fa/2019-07-24T18:30:11.767338Z], OS[Linux/3.10.0-1127.19.1.el7.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/12.0.1/12.0.1+12]
    [2020-11-05T19:06:29,105][INFO ][o.e.n.Node               ] [siemdata15] JVM home [/usr/share/elasticsearch/jdk]
    [2020-11-05T19:06:29,106][INFO ][o.e.n.Node               ] [siemdata15] JVM arguments [-Xms30g, -Xmx30g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true
    , -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch-8327095790812764947, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Djava.locale.providers=COMPAT, -Dio.netty.allocator.type=pooled, -
    XX:MaxDirectMemorySize=16106127360, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=rpm, -Des.bundled_jdk=true]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [aggs-matrix-stats]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [analysis-common]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [data-frame]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [flattened]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [ingest-common]
    [2020-11-05T19:06:30,727][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [ingest-geoip]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [ingest-user-agent]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [lang-expression]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [lang-mustache]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [lang-painless]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [mapper-extras]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [parent-join]
    [2020-11-05T19:06:30,728][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [percolator]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [rank-eval]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [reindex]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [repository-url]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [transport-netty4]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [vectors]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-ccr]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-core]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-deprecation]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-graph]
    [2020-11-05T19:06:30,729][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-ilm]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-logstash]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-ml]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-monitoring]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-rollup]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-security]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-sql]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-voting-only-node]
    [2020-11-05T19:06:30,730][INFO ][o.e.p.PluginsService     ] [siemdata15] loaded module [x-pack-watcher]
    [2020-11-05T19:06:30,731][INFO ][o.e.p.PluginsService     ] [siemdata15] no plugins loaded
    [2020-11-05T19:06:33,557][INFO ][o.e.x.s.a.s.FileRolesStore] [siemdata15] parsed [0] roles from file [/etc/elasticsearch/roles.yml]
    [2020-11-05T19:06:33,997][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [siemdata15] [controller/5706] [Main.cc@110] controller (64 bit): Version 7.3.0 (Build ff2f774f78ce63) Copyright (c) 2019 Elasticsearch BV
    [2020-11-05T19:06:34,300][DEBUG][o.e.a.ActionModule       ] [siemdata15] Using REST wrapper from plugin org.elasticsearch.xpack.security.Security
    [2020-11-05T19:06:34,562][ERROR][o.e.g.GatewayMetaState   ] [siemdata15] failed to read or upgrade local state, exiting...
    java.io.IOException: failed to find metadata for existing index ois-duo-admin-2020.02.14 [location: BVqRvM9zSK6KLtHuA6HB_A, generation: 13]
    [2020-11-05T19:06:34,569][ERROR][o.e.b.Bootstrap          ] [siemdata15] Exception
    org.elasticsearch.ElasticsearchException: failed to bind service
            at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.0.jar:7.3.0]
    Caused by: java.io.IOException: failed to find metadata for existing index ois-duo-admin-2020.02.14 [location: BVqRvM9zSK6KLtHuA6HB_A, generation: 13]
            at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.0.jar:7.3.0]
[siemdata15] uncaught exception in thread [main]
    org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index ois-duo-admin-2020.02.14 [location: BVqRvM9zSK6KLtHuA6HB_A, generation: 13]];
            at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.0.jar:7.3.0]
    [2020-11-05T19:06:34,576][INFO ][o.e.x.m.p.NativeController] [siemdata15] Native controller process has stopped - no new native processes can be started

I've pulled out some of the "at org.elasticsearch. ... " lines because they're repetitive but left the first line of the error. Let me know if I need to do the full java trace.

And here is the log from the "master" at the same time

    {"type": "server", "timestamp": "2020-11-05T19:02:42,320+0000", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "node-left[{siemdata15}{428f6b6dRAup9vP5p6vtkg}{5l1rSieTRT-8Clv_Nz7L9Q}{192.168.100.46}{192.168.100.46:9300}{di}{ml.machine_memory=670116986
    88, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 8804, version: 1932323, reason: removed {{siemdata15}{428f6b6dRAup9vP5p6vtkg}{5l1rSieTRT-8Clv_Nz7L9Q}{192.168.100.46}{192.168.100.46:9300}{di}{ml.machine_memory=67011698688, ml.max_open_jobs=20, xpack.installed=true},}"  }
    {"type": "server", "timestamp": "2020-11-05T19:02:50,462+0000", "level": "DEBUG", "component": "o.e.a.a.i.r.TransportRecoveryAction", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to execute [indices:monitor/recovery] on node [428f6b6dRAup9vP5p6vtkg]" ,
    {"type": "server", "timestamp": "2020-11-05T19:02:50,545+0000", "level": "DEBUG", "component": "o.e.a.a.i.s.TransportIndicesStatsAction", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to execute [indices:monitor/stats] on node [428f6b6dRAup9vP5p6vtkg]" ,
    {"type": "server", "timestamp": "2020-11-05T19:02:50,867+0000", "level": "DEBUG", "component": "o.e.a.a.c.s.TransportClusterStatsAction", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to execute on node [428f6b6dRAup9vP5p6vtkg]" ,
    {"type": "server", "timestamp": "2020-11-05T19:02:52,055+0000", "level": "DEBUG", "component": "o.e.a.a.c.n.s.TransportNodesStatsAction", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to execute on node [428f6b6dRAup9vP5p6vtkg]" ,
    {"type": "server", "timestamp": "2020-11-05T19:02:52,161+0000", "level": "DEBUG", "component": "o.e.a.a.i.s.TransportIndicesStatsAction", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to execute [indices:monitor/stats] on node [428f6b6dRAup9vP5p6vtkg]" ,
    {"type": "server", "timestamp": "2020-11-05T19:02:52,695+0000", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "removed {{siemdata15}{428f6b6dRAup9vP5p6vtkg}{5l1rSieTRT-8Clv_Nz7L9Q}{192.168.100.46}{192.168.100.46:9300}{di}{ml.machine_memory=67
    011698688, ml.max_open_jobs=20, xpack.installed=true},}, term: 8804, version: 1932323, reason: Publication{term=8804, version=1932323}"  }
    {"type": "server", "timestamp": "2020-11-05T19:03:15,454+0000", "level": "WARN", "component": "o.e.c.NodeConnectionsService", "cluster.name": "siem", "node.name": "geardachadh", "cluster.uuid": "9ZHB7pI8S6id2d6ia5nzRQ", "node.id": "sI-w37AFTASGHuh4-9GVxA",  "message": "failed to connect to {siemdata15}{428f6b6dRAup9vP5p6vtkg}{5l1rSieTRT-8Clv_Nz7L9Q}{192.168.100.46}{192.168.100.46:9300}{di}{ml.machin
    e_memory=67011698688, ml.max_open_jobs=20, xpack.installed=true} (tried [1] times)" ,

The next thing in that system's log that references either siemdata15 (the node that restarted) or its node id is at 19.28 after I'd cleaned out /var/lib/elasticsearch and effectively brought it in as a new node.

1 Like

This looks like a known bug that relates to closed indices and was fixed in 7.4.1. There is a workaround, but an upgrade is the right way to address this.

1 Like

Brilliant! Thanks, David!! I had tried searching for similar behaviour before finally posting here but I was searching for ES not restarting, looking for ES + XFS, ES + nvme, etc, bugs. Looks like I got to live up to my nick =)

That did the trick -- stopped elasticsearch, tried a restart, got the error, stopped the service, removed the manifest

$DATA_PATH/nodes/0/_state/manifest-<DIGITS>.st

and the service restart happened error-free. Looking more and more forward to that 7.10 upgrade in a month...

Thank you, again!!

2 Likes