I have an Elasticsearch cluster with 3 master nodes, 1 data node, 1 ingest node, and 1 coordinating node. I also have 1 Kibana instance and 1 Fleet Server. Data ingestion is performed through approximately 12 Elastic Agents managed by Fleet. In the Fleet Server settings in Kibana, the server is pointing to the Fleet Server, and the output is configured to send data to the coordinating node.
Currently, the cluster has around 800 shards, although disk usage is low (about 50 GB on the data node), so this does not appear to be a storage capacity issue. The average ingestion volume is around 60,000 documents per minute across all agents.
The issue is that after 2–3 days of continuous operation, the coordinating node crashes (typically associated with memory pressure), causing instability in the cluster. I am trying to determine whether the root cause is related to the ingestion architecture (all traffic going through the coordinating node), the high number of shards, or a combination of both factor
s.
