I have an Elasticsearch cluster with 3 master nodes, 1 data node, 1 ingest node, and 1 coordinating node. I also have 1 Kibana instance and 1 Fleet Server. Data ingestion is performed through approximately 12 Elastic Agents managed by Fleet. In the Fleet Server settings in Kibana, the server is pointing to the Fleet Server, and the output is configured to send data to the coordinating node.
Currently, the cluster has around 800 shards, although disk usage is low (about 50 GB on the data node), so this does not appear to be a storage capacity issue. The average ingestion volume is around 60,000 documents per minute across all agents.
The issue is that after 2–3 days of continuous operation, the coordinating node crashes (typically associated with memory pressure), causing instability in the cluster. I am trying to determine whether the root cause is related to the ingestion architecture (all traffic going through the coordinating node), the high number of shards, or a combination of both factor
Unfortunately there's not really a way to answer this from the information provided. We have clusters running for months under much heavier load without seeing any problems like this. You will need to look at the heap dump to work out what consumed all the memory.
This setup is definitely not resilient and seems overly complex for your needs. You would be better-served with 3 nodes that just do everything.
I’ve been reviewing my cluster configuration in more detail, and I noticed something that might be relevant. Currently, my cluster has around 800 shards in total, while the actual data volume is relatively small (around 50 GB on the data node). This results in very small shard sizes.
From what I’ve been reading, this seems far from recommended shard sizing guidelines, and I’m starting to suspect that the high shard count could be putting additional pressure on heap memory—especially on the coordinating node, which is handling all ingestion traffic.
Would it be reasonable to consider the number of shards as a primary root cause of the memory issues I’m experiencing on the coordinating node?
Also, as context: I’m relatively new to Elasticsearch, and this is actually my first real-world cluster deployment after graduating, so I’m still building a solid understanding of best practices. I’d really appreciate any guidance on whether I should prioritize reducing shard count versus redesigning the node roles.
It's not ideal and worth fixing but this wouldn't explain the symptoms you described in the OP. If it was going to fail because of this, it'd do so immediately.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.