4 Data nodes, struggling with simultaneous multiple heavy aggregations.. How to scale?

Hello, hoping someone can give me some more insight on my issue:

I have an ES cluster with 4 data-only nodes (4 core/32GB RAM), under heavy aggregation scenarios (mostly large Kibana dashboards with multiple complex visualizations over longer time frames) the heap crosses 95% used, and node(s) crash.

I have the chance to scale by either adding a 5th identical data node, or by doubling the specs on the 4 existing nodes (to 4 core/64GB RAM each)

We use doc_values extensively as well as pretty strict limits (by config) on fielddata cache size, and I don't believe it is a field data issue.

I'm not sure where else to look next, or which scaling strategy will be most effective for this use case and would greatly appreciate any advice on either that is available.

Thanks!

This will be a case of scaling horizontally more than anything.

In this case it may make sense to scale up to 64GB of RAM before scaling out as it will give you more available heap space compared to adding an additional node.

So you think that increasing the ability to distribute the queries across nodes by 20% won't alleviate heap pressure with the same level of effectiveness?

@warkolm just above you said basically the exact opposite, so I'm just trying to to get all my ducks in a row so to speak

Is that heap or total?
If it's the former then you can't scale vertically any more, if it's the latter then what Christian said can apply.

Thanks! Can you explain a bit why you think adding the 5th node (20% greater distribution of queries) would be better than doubling the available resources on each existing node?

Total. As per the docs, I'm running the nodes at ~16GB (50% of total system RAM)

I misunderstood your original post.

Thank you for taking the time to respond and clarify. Does that mean you agree with @Christian_Dahlqvist's assessment that scaling vertically in this case is likely to make most sense?

Do your existing 4 CPU cores and disks have room? If so, and it makes business sense to just increate the memory, then yes.