Hello, hoping someone can give me some more insight on my issue:
I have an ES cluster with 4 data-only nodes (4 core/32GB RAM), under heavy aggregation scenarios (mostly large Kibana dashboards with multiple complex visualizations over longer time frames) the heap crosses 95% used, and node(s) crash.
I have the chance to scale by either adding a 5th identical data node, or by doubling the specs on the 4 existing nodes (to 4 core/64GB RAM each)
We use doc_values extensively as well as pretty strict limits (by config) on fielddata cache size, and I don't believe it is a field data issue.
I'm not sure where else to look next, or which scaling strategy will be most effective for this use case and would greatly appreciate any advice on either that is available.