Elastic cluster is getting down after 2 - 3 hours

If I calculate correctly that is an average shard size of around 130MB which is very small when it comes to Elasticsearch. If your data has a long retention period I would recommend that you switch to monthly indices instead of daily, which should bring the average size up a bit.

If your retention period is too short to warrant monthly indices I suspect you have a lot of index patterns, and in that case I would probably advise you to look into combining indices.

Sure @Christian_Dahlqvist

I will update it to the monthly indices and try

I recall running some tests a long time back where I tried to put as much data on a single node as possible. If I remember correctly there was a higher proportion of memory overhead for small shards compared to larger ones. If this is still the case (a lot has however changed in Elasticsearch since I ran that test
) having fewer larger shards may decrease heap pressure and improve stability.

If you have indices that are no longer written to you can also try to merge these into monthly indices through reindexing.

I agree with @Christian_Dahlqvist on moving to monthly indices , but I I still do not see how the 6-node cluster across 3 physical nodes benefits you, over a 3 node cluster? What's the logic here ? Are there 2 separate IO paths that the 2 different instances can utilise?

Elara / Kore / Europa were "dm" , i.e. so data nodes and master eligibkle

Amalthea / Kale / Callisto were "d", i.e. just data nodes

So I hope you have Elara / Kore / Europa on different physical nodes, otherwise your cluster cannot survive a node failure.

How can I merge existing indices into monthly indices?
And As far as I know the index might store around 10GB if the application under test? How to avoid the memory pressure if I use monthly indices?

You can use the reindex API.

You are still querying the same data set and have memory pressure from querying all the daily indices. I suspect a single 10GB shard may require less memory overhead than 30 correponding daily shards of 333MB each. Note that 10GB is not large when it comes to Elasticsearch as the recommended shard size for log and metrics use cases is generally between 30Gb and 50GB.

@RainTown

We will drill down to 1 node per machine if required to eliminate the issue. But we are looking for a smooth way so that we may not loose the data by doing so? Do you have any solution for this.? I would like to convert it to 3 nodes cluster by making 1 node per each machine.

OK @Christian_Dahlqvist

It's actually quite easy, just eject all the data from a node, wait til the shards all relocate elsewhere, then shut it down. You will have a green, 5-node cluster. It might take a long time, but it's not a complex procedure. Rinse and repeat 2 more times. But you need be a bit careful that primary and replica shards dont end up on the same (physical) host. This may be a problem already, unless you have configured otherwise.

Personally, from what you have shared, I'd prioritize the monthly indices first. And that's less easy tbh. Because of my very first response:

If all the blabla-2025-11-* indices were all effectively frozen, you can just reindex them into say blabla-2025-11, just wait for that too complete, and eventually just delete all the blabla-2025-11-* indices. But if during that reindex you are still getting new documents into those existing indices, in sort of random date order, then that's a bit harder. And it seems you do have that issue!

So reIndexing is suggestible? or I should follow any other process?

Good question. Yes, I would still do it. You can block writes to the blabla-2025-11-* indices too. But you know your data flows, why are logs, if they are logs, from 26 Nov being ingested on 10th Dec? doesn't that indicate an issue somewhere else?

EDIT: if you start 6 months ago, do July 2025 first, then I hope this is less of an issue then. But without downtime it's going to be hard to do piecemeal if data and indices are just appearing is semi random order.

All this discussion about sizing & sharding & memory usage and so on is good but it’s not really answering the OP’s question. Why is this node going down?

This sequence of messages indicates a normal process exit: the node is shutting down gracefully because it’s being asked to shut down by some external influence sending it a signal (usually a SIGTERM but possibly something else, depending on how exactly the process is being managed).

It doesn’t matter how much you change any of the aspects of this cluster in the ways discussed above, it will always exit if asked to do so as we are seeing here.

1 Like

You are absolutely right. Maybe if setup differently, meant in a generic way, it wont be gracefully shut down every 2-3 hours. But indeed, cart (what to do) got a bit before horse (what is actually wrong in your specific case).

The question was asked but rather skipped.

@sathish12

Your system reported as CentOS 8. You said you used the .tar.gz file, so how do you start/stop the 2x elasticsearch instances per host? Did you integrate with systemd? You need to look at the system logs and try to find out why Elasticsearch was being shutdown, as per David. It's maybe a memory issue, but that's just a guess and it's for sure better to actually know, as it could be something entirely unrelated to anything on the thread.

@DavidTurner
Could you let me know how to check this?

I did not find any ERROR or WARN logs before the Native controller issue

@RainTown

I am running the elastic in background using '&' for each node

@RainTown @DavidTurner @Christian_Dahlqvist

I am not sure. But yesterday I have changed to 7GB heap for each node and I have observed that the cluster was staying few hours longer till 6 hours and then went down

Could you check this?

grep -i kill /var/log/messages*
host kernel: Out of Memory: Killed process 2592 (xxx).

May be you have an OOM Killer which killed the process?

I am seeing the logs of around 20 days back regarding OOM. It has NOV21 not the recent ones

OK, but even that is a bit suggestive? Can you post the exact message matched. Also note some of the file in /var/log/ might be compressed.

Also, can you share the elasticsearch.yml files for both instances. Are the elastic processes running as same (local) user? If running as "elasticsearch: user, then please share output of

ps -uelasticsearch --cols 9999 -opid,ppid,stime,etime,rss,command

replace elasticsearch by whichever user or users you are using.

What about stdout and stderr for these background processes, where do they go? can you share the full command you use to start the instances please.

If you start with less heap, then there is more "general" memory available for all processes, so that might mean it would take longer to reach point where it crashes than if started with a larger heap. But there 101 other factors too. It is a bit suggestive of course.

When you write "then went down" can you be more specific. Does one instance stop? Is there a cascade and all the nodes stop? Something else?

1 Like