ES fails to restart after reboot

version 7.17.1 running on ubuntu -- started from systemctl

When ever the server is rebooted ES fails to restart properly. Subsequent manual restart works just fine.

[2023-07-16T01:37:33,109][INFO ][o.e.p.PluginsService     ] [secesprd06] no plugins loaded
[2023-07-16T01:37:33,223][INFO ][o.e.e.NodeEnvironment    ] [secesprd06] using [1] data paths, mounts [[/data (/dev/mapper/vgdata-lvdata)]], net usable_space [710.8gb], net total_space [1.4tb], types [xfs]
[2023-07-16T01:37:33,224][INFO ][o.e.e.NodeEnvironment    ] [secesprd06] heap size [30gb], compressed ordinary object pointers [true]
[2023-07-16T01:37:34,192][INFO ][o.e.n.Node               ] [secesprd06] node name [secesprd06], node ID [XcpMS9QmT06yxJ0WX9avlQ], cluster name [security], roles [data_cold]
[2023-07-16T01:37:44,551][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [secesprd06] [controller/2881] [Main.cc@122] controller (64 bit): Version 7.17.1 (Build 6d8a28b39bf223) Copyright (c) 2022 Elasticsearch BV
[2023-07-16T01:37:45,240][INFO ][o.e.x.s.a.Realms         ] [secesprd06] license mode is [trial], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2023-07-16T01:37:45,253][INFO ][o.e.x.s.a.s.FileRolesStore] [secesprd06] parsed [0] roles from file [/etc/elasticsearch/roles.yml]
[2023-07-16T01:37:47,513][INFO ][o.e.i.g.ConfigDatabases  ] [secesprd06] initialized default databases [[GeoLite2-Country.mmdb, GeoLite2-City.mmdb, GeoLite2-ASN.mmdb]], config databases [[]] and watching [/etc/elasticsearch/ingest-geoip] for changes
[2023-07-16T01:37:47,578][INFO ][o.e.i.g.DatabaseNodeService] [secesprd06] initialized database registry, using geoip-databases directory [/tmp/geoip-databases/XcpMS9QmT06yxJ0WX9avlQ]
[2023-07-16T01:37:49,492][INFO ][o.e.t.NettyAllocator     ] [secesprd06] creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, suggested_max_allocation_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=16mb}]
[2023-07-16T01:37:49,605][INFO ][o.e.i.r.RecoverySettings ] [secesprd06] using rate limit [125mb] with [default=125mb, read=0b, write=0b, max=0b]
[2023-07-16T01:37:49,703][INFO ][o.e.d.DiscoveryModule    ] [secesprd06] using discovery type [zen] and seed hosts providers [settings]
[2023-07-16T01:37:50,595][INFO ][o.e.g.DanglingIndicesState] [secesprd06] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2023-07-16T01:37:51,419][INFO ][o.e.n.Node               ] [secesprd06] initialized
[2023-07-16T01:37:51,420][INFO ][o.e.n.Node               ] [secesprd06] starting ...
[2023-07-16T01:37:51,528][INFO ][o.e.x.s.c.f.PersistentCache] [secesprd06] persistent cache index loaded
[2023-07-16T01:37:51,530][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [secesprd06] deprecation component started
[2023-07-16T01:37:51,890][INFO ][o.e.t.TransportService   ] [secesprd06] publish_address {130.216.236.249:9300}, bound_addresses {[::]:9300}
[2023-07-16T01:37:56,050][INFO ][o.e.b.BootstrapChecks    ] [secesprd06] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2023-07-16T01:37:56,053][INFO ][o.e.c.c.Coordinator      ] [secesprd06] cluster UUID [CjDhDttPTLafKvn-MYx3vA]
[2023-07-16T01:37:56,872][INFO ][o.e.n.Node               ] [secesprd06] stopping ...
[2023-07-16T01:37:56,893][INFO ][o.e.x.w.WatcherService   ] [secesprd06] stopping watch service, reason [shutdown initiated]
[2023-07-16T01:37:56,894][INFO ][o.e.x.w.WatcherLifeCycleService] [secesprd06] watcher has stopped and shutdown
[2023-07-16T01:37:56,909][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [secesprd06] [controller/2881] [Main.cc@174] ML controller exiting
[2023-07-16T01:37:56,910][INFO ][o.e.x.m.p.NativeController] [secesprd06] Native controller process has stopped - no new native processes can be started
[2023-07-16T01:37:57,244][INFO ][o.e.h.AbstractHttpServerTransport] [secesprd06] publish_address {130.216.236.249:9200}, bound_addresses {[::]:9200}
[2023-07-16T01:37:57,245][INFO ][o.e.n.Node               ] [secesprd06] started
[2023-07-16T01:37:57,266][INFO ][o.e.n.Node               ] [secesprd06] stopped
[2023-07-16T01:37:57,266][INFO ][o.e.n.Node               ] [secesprd06] closing ...
[2023-07-16T01:37:57,332][INFO ][o.e.n.Node               ] [secesprd06] closed

Any idea what is causing this?

Other nodes in the cluster which have the same configuration do not have this problem.

I suspect it's the situation described by "TIP: Startup timeouts with older systemd versions" in these docs, except that the startup timeout is only 75s in 7.17.1 (see this PR). If so, either upgrade your systemd, or Elasticsearch, or else adjust the startup timeout yourself.

Thanks David

As usual it seems that I did not give enough details : (

In this case I really want to know why ES is taking so long to start up after a reboot. When I manually restart ES it starts fine in under a minute. An identical VM running ES reboots and ES starts in the just fine.

I suspect some issue with VM but without some diagnostics ...

I see. That's a trickier one to answer. Maybe all the nodes are taking just under 75s and this one is just marginally slower? In any case the default timeout is 900s now because we don't really consider minutes-long startup times to be particularly unusual.

Once the node is up does it have performance in line with other nodes? If not, that's worth investigating, and should be easier to investigate too because you're looking at something more like a steady-state situation. Capturing profiling data at startup is probably possible, but I'm not sure how I'd approach it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.