If I calculate correctly that is an average shard size of around 130MB which is very small when it comes to Elasticsearch. If your data has a long retention period I would recommend that you switch to monthly indices instead of daily, which should bring the average size up a bit.
If your retention period is too short to warrant monthly indices I suspect you have a lot of index patterns, and in that case I would probably advise you to look into combining indices.
I recall running some tests a long time back where I tried to put as much data on a single node as possible. If I remember correctly there was a higher proportion of memory overhead for small shards compared to larger ones. If this is still the case (a lot has however changed in Elasticsearch since I ran that testâŠ) having fewer larger shards may decrease heap pressure and improve stability.
If you have indices that are no longer written to you can also try to merge these into monthly indices through reindexing.
I agree with @Christian_Dahlqvist on moving to monthly indices , but I I still do not see how the 6-node cluster across 3 physical nodes benefits you, over a 3 node cluster? What's the logic here ? Are there 2 separate IO paths that the 2 different instances can utilise?
Elara / Kore / Europa were "dm" , i.e. so data nodes and master eligibkle
Amalthea / Kale / Callisto were "d", i.e. just data nodes
So I hope you have Elara / Kore / Europa on different physical nodes, otherwise your cluster cannot survive a node failure.
How can I merge existing indices into monthly indices?
And As far as I know the index might store around 10GB if the application under test? How to avoid the memory pressure if I use monthly indices?
You are still querying the same data set and have memory pressure from querying all the daily indices. I suspect a single 10GB shard may require less memory overhead than 30 correponding daily shards of 333MB each. Note that 10GB is not large when it comes to Elasticsearch as the recommended shard size for log and metrics use cases is generally between 30Gb and 50GB.
We will drill down to 1 node per machine if required to eliminate the issue. But we are looking for a smooth way so that we may not loose the data by doing so? Do you have any solution for this.? I would like to convert it to 3 nodes cluster by making 1 node per each machine.
It's actually quite easy, just eject all the data from a node, wait til the shards all relocate elsewhere, then shut it down. You will have a green, 5-node cluster. It might take a long time, but it's not a complex procedure. Rinse and repeat 2 more times. But you need be a bit careful that primary and replica shards dont end up on the same (physical) host. This may be a problem already, unless you have configured otherwise.
Personally, from what you have shared, I'd prioritize the monthly indices first. And that's less easy tbh. Because of my very first response:
If all the blabla-2025-11-* indices were all effectively frozen, you can just reindex them into say blabla-2025-11, just wait for that too complete, and eventually just delete all the blabla-2025-11-* indices. But if during that reindex you are still getting new documents into those existing indices, in sort of random date order, then that's a bit harder. And it seems you do have that issue!
Good question. Yes, I would still do it. You can block writes to the blabla-2025-11-* indices too. But you know your data flows, why are logs, if they are logs, from 26 Nov being ingested on 10th Dec? doesn't that indicate an issue somewhere else?
EDIT: if you start 6 months ago, do July 2025 first, then I hope this is less of an issue then. But without downtime it's going to be hard to do piecemeal if data and indices are just appearing is semi random order.
All this discussion about sizing & sharding & memory usage and so on is good but itâs not really answering the OPâs question. Why is this node going down?
This sequence of messages indicates a normal process exit: the node is shutting down gracefully because itâs being asked to shut down by some external influence sending it a signal (usually a SIGTERM but possibly something else, depending on how exactly the process is being managed).
It doesnât matter how much you change any of the aspects of this cluster in the ways discussed above, it will always exit if asked to do so as we are seeing here.
You are absolutely right. Maybe if setup differently, meant in a generic way, it wont be gracefully shut down every 2-3 hours. But indeed, cart (what to do) got a bit before horse (what is actually wrong in your specific case).
Your system reported as CentOS 8. You said you used the .tar.gz file, so how do you start/stop the 2x elasticsearch instances per host? Did you integrate with systemd? You need to look at the system logs and try to find out why Elasticsearch was being shutdown, as per David. It's maybe a memory issue, but that's just a guess and it's for sure better to actually know, as it could be something entirely unrelated to anything on the thread.
I am not sure. But yesterday I have changed to 7GB heap for each node and I have observed that the cluster was staying few hours longer till 6 hours and then went down
OK, but even that is a bit suggestive? Can you post the exact message matched. Also note some of the file in /var/log/ might be compressed.
Also, can you share the elasticsearch.yml files for both instances. Are the elastic processes running as same (local) user? If running as "elasticsearch: user, then please share output of
replace elasticsearch by whichever user or users you are using.
What about stdout and stderr for these background processes, where do they go? can you share the full command you use to start the instances please.
If you start with less heap, then there is more "general" memory available for all processes, so that might mean it would take longer to reach point where it crashes than if started with a larger heap. But there 101 other factors too. It is a bit suggestive of course.
When you write "then went down" can you be more specific. Does one instance stop? Is there a cascade and all the nodes stop? Something else?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.