Elasticsearch coordinating node OOM/crash under sustained ingest (60k docs/min) with high shard count (~800)

I have an Elasticsearch cluster with 3 master nodes (6 GB RAM each), 2 data nodes (16 GB RAM each), 1 ingest node (8 GB RAM), 1 coordinating node (16 GB RAM), and 1 transform node (6 GB RAM). Additionally, I have 1 Kibana instance (8 GB RAM) and 1 Fleet Server (6 GB RAM). Data ingestion is performed through approximately 12 Elastic Agents managed by Fleet. In the Fleet Server settings in Kibana, the server is pointing to the Fleet Server, and the output is configured to send data to the coordinating node.

Currently, the cluster has around 800 shards, although disk usage is low (about 50 GB total across the data nodes), so this does not appear to be a storage capacity issue. The average ingestion volume is around 60,000 documents per minute across all agents.

The issue is that after 2–3 days of continuous operation, the coordinating node crashes (typically associated with memory pressure), causing instability in the cluster. I am trying to determine whether the root cause is related to the ingestion architecture (all traffic going through the coordinating node), the high number of shards, or a combination of both factors.

Any guidance on how to properly diagnose this issue or recommendations on prioritizing between shard optimization and ingest architecture changes would be greatly appreciated.

Closed as duplicate of

Please have patience when you are new and posting