Elastic cluster is getting down after 2 - 3 hours

Christian_Dahlqvist · December 11, 2025, 12:37pm

If I calculate correctly that is an average shard size of around 130MB which is very small when it comes to Elasticsearch. If your data has a long retention period I would recommend that you switch to monthly indices instead of daily, which should bring the average size up a bit.

If your retention period is too short to warrant monthly indices I suspect you have a lot of index patterns, and in that case I would probably advise you to look into combining indices.

sathish12 · December 11, 2025, 12:42pm

Sure @Christian_Dahlqvist

I will update it to the monthly indices and try

Christian_Dahlqvist · December 11, 2025, 12:47pm

I recall running some tests a long time back where I tried to put as much data on a single node as possible. If I remember correctly there was a higher proportion of memory overhead for small shards compared to larger ones. If this is still the case (a lot has however changed in Elasticsearch since I ran that test…) having fewer larger shards may decrease heap pressure and improve stability.

If you have indices that are no longer written to you can also try to merge these into monthly indices through reindexing.

RainTown · December 11, 2025, 12:56pm

I agree with @Christian_Dahlqvist on moving to monthly indices , but I I still do not see how the 6-node cluster across 3 physical nodes benefits you, over a 3 node cluster? What's the logic here ? Are there 2 separate IO paths that the 2 different instances can utilise?

Elara / Kore / Europa were "dm" , i.e. so data nodes and master eligibkle

Amalthea / Kale / Callisto were "d", i.e. just data nodes

So I hope you have Elara / Kore / Europa on different physical nodes, otherwise your cluster cannot survive a node failure.

sathish12 · December 11, 2025, 1:07pm

How can I merge existing indices into monthly indices?
And As far as I know the index might store around 10GB if the application under test? How to avoid the memory pressure if I use monthly indices?

Christian_Dahlqvist · December 11, 2025, 1:12pm

You can use the reindex API.

You are still querying the same data set and have memory pressure from querying all the daily indices. I suspect a single 10GB shard may require less memory overhead than 30 correponding daily shards of 333MB each. Note that 10GB is not large when it comes to Elasticsearch as the recommended shard size for log and metrics use cases is generally between 30Gb and 50GB.

sathish12 · December 11, 2025, 1:15pm

@RainTown

We will drill down to 1 node per machine if required to eliminate the issue. But we are looking for a smooth way so that we may not loose the data by doing so? Do you have any solution for this.? I would like to convert it to 3 nodes cluster by making 1 node per each machine.

sathish12 · December 11, 2025, 1:17pm

OK @Christian_Dahlqvist

RainTown · December 11, 2025, 1:33pm

It's actually quite easy, just eject all the data from a node, wait til the shards all relocate elsewhere, then shut it down. You will have a green, 5-node cluster. It might take a long time, but it's not a complex procedure. Rinse and repeat 2 more times. But you need be a bit careful that primary and replica shards dont end up on the same (physical) host. This may be a problem already, unless you have configured otherwise.

Personally, from what you have shared, I'd prioritize the monthly indices first. And that's less easy tbh. Because of my very first response:

If all the blabla-2025-11-* indices were all effectively frozen, you can just reindex them into say blabla-2025-11, just wait for that too complete, and eventually just delete all the blabla-2025-11-* indices. But if during that reindex you are still getting new documents into those existing indices, in sort of random date order, then that's a bit harder. And it seems you do have that issue!

sathish12 · December 11, 2025, 1:54pm

So reIndexing is suggestible? or I should follow any other process?

RainTown · December 11, 2025, 2:31pm

Good question. Yes, I would still do it. You can block writes to the blabla-2025-11-* indices too. But you know your data flows, why are logs, if they are logs, from 26 Nov being ingested on 10th Dec? doesn't that indicate an issue somewhere else?

EDIT: if you start 6 months ago, do July 2025 first, then I hope this is less of an issue then. But without downtime it's going to be hard to do piecemeal if data and indices are just appearing is semi random order.

DavidTurner · December 11, 2025, 5:30pm

All this discussion about sizing & sharding & memory usage and so on is good but it’s not really answering the OP’s question. Why is this node going down?

sathish12:

[2025-12-10T15:04:23,537][INFO ][o.e.n.Node ] [Kale] stopping ...
[2025-12-10T15:04:23,540][INFO ][o.e.x.w.WatcherService ] [Kale] stopping watch service, reason [shutdown initiated]
[2025-12-10T15:04:23,541][INFO ][o.e.x.w.WatcherLifeCycleService] [Kale] watcher has stopped and shutdown
[2025-12-10T15:04:23,595][INFO ][o.e.c.c.Coordinator ] [Kale] master node [{Europa}{asrawfdsvdzvdc}{asrawfdsvdzvdc}{Europa}{x.x.x.x.}{x.x.x.x.:9300}{dm}{8.13.4}{7000099-8503000}] disconnected, restarting discovery
[2025-12-10T15:04:23,608][INFO ][o.e.h.AbstractHttpServerTransport] [Kale] channel [Netty4HttpChannel{localAddress=/x.x.x.x.:9201, remoteAddress=/x.x.x.y:64422}] already closed
[2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] stopped
[2025-12-10T15:04:30,351][INFO ][o.e.n.Node ] [Kale] closing ...
[2025-12-10T15:04:30,444][INFO ][o.e.n.Node ] [Kale] closed

This sequence of messages indicates a normal process exit: the node is shutting down gracefully because it’s being asked to shut down by some external influence sending it a signal (usually a SIGTERM but possibly something else, depending on how exactly the process is being managed).

It doesn’t matter how much you change any of the aspects of this cluster in the ways discussed above, it will always exit if asked to do so as we are seeing here.

RainTown · December 11, 2025, 5:54pm

You are absolutely right. Maybe if setup differently, meant in a generic way, it wont be gracefully shut down every 2-3 hours. But indeed, cart (what to do) got a bit before horse (what is actually wrong in your specific case).

The question was asked but rather skipped.

@sathish12

Your system reported as CentOS 8. You said you used the .tar.gz file, so how do you start/stop the 2x elasticsearch instances per host? Did you integrate with systemd? You need to look at the system logs and try to find out why Elasticsearch was being shutdown, as per David. It's maybe a memory issue, but that's just a guess and it's for sure better to actually know, as it could be something entirely unrelated to anything on the thread.

sathish12 · December 12, 2025, 3:02am

@DavidTurner
Could you let me know how to check this?

sathish12 · December 12, 2025, 3:03am

I did not find any ERROR or WARN logs before the Native controller issue

sathish12 · December 12, 2025, 3:04am

@RainTown

I am running the elastic in background using '&' for each node

sathish12 · December 12, 2025, 3:07am

@RainTown @DavidTurner @Christian_Dahlqvist

I am not sure. But yesterday I have changed to 7GB heap for each node and I have observed that the cluster was staying few hours longer till 6 hours and then went down

dadoonet · December 12, 2025, 6:41am

Could you check this?

grep -i kill /var/log/messages*
host kernel: Out of Memory: Killed process 2592 (xxx).

May be you have an OOM Killer which killed the process?

sathish12 · December 12, 2025, 6:51am

I am seeing the logs of around 20 days back regarding OOM. It has NOV21 not the recent ones

RainTown · December 12, 2025, 8:40am

OK, but even that is a bit suggestive? Can you post the exact message matched. Also note some of the file in /var/log/ might be compressed.

Also, can you share the elasticsearch.yml files for both instances. Are the elastic processes running as same (local) user? If running as "elasticsearch: user, then please share output of

ps -uelasticsearch --cols 9999 -opid,ppid,stime,etime,rss,command

replace elasticsearch by whichever user or users you are using.

What about stdout and stderr for these background processes, where do they go? can you share the full command you use to start the instances please.

If you start with less heap, then there is more "general" memory available for all processes, so that might mean it would take longer to reach point where it crashes than if started with a larger heap. But there 101 other factors too. It is a bit suggestive of course.

When you write "then went down" can you be more specific. Does one instance stop? Is there a cascade and all the nodes stop? Something else?

Topic		Replies	Views
Elasticsearch dies every other day Elasticsearch	15	1674	July 6, 2017
Cascading cluster failure Elasticsearch	13	534	July 6, 2017
ElasticSearch crashes in single node cluster- Issue #1 Elasticsearch	20	2990	June 12, 2019
Out of memory, missing shards, looks like split-brain Elasticsearch	10	867	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	583	July 6, 2017

Elastic cluster is getting down after 2 - 3 hours

Related topics