We have several large clusters v1.7. Please don't ask why - we are on the way to upgrade to v7 but need some hints to make it stable.
Recently one of the clusters started crashing several times per day and we see some strange heap behavior.
Other clusters work as expected with a similar write rate.
2 query nodes
50 data nodes
Every node has 30G heap (50% of the VM), ConcMarkSweepGC
Write rate is ~200k doc/min and small number or queries.
There are ~1500 indices with up to 16 shards.
Tried many things and nothing helped so far (different merge policies and index refresh settings). At some point, GC could not free memory and the heap start growing.
As an additional observation - _cat API is very slow and it's mostly impossible to get even _cat/nodes (on other clusters it works as well).
As already mentioned version 1.7 is very old. I have not used it in years so do not remember it very well apart from that it was hard to control heap usage and that propagating cluster state was slow and inefficient making running large clusters with lots of shards harder than it is nowadays. I do have a few comments though:
You should always have 3 master eligible nodes as this is required for a highly available cluster. You also need to make sure that discovery.zen.minimum_master_nodes is set to 2 in order to avoid split brain scenarios and data loss.
How many shards do you have in the cluster? Are these time-based? How often do you create and/or delete indices in the cluster? Are you using dynamic mappings?
If you have time-based indices and some are no longer written to it makes sense to forcemerge these down to a single segment in order to reduce heap usage.
I believe this often involves the master nodes, which could very well be overloaded given the size of the cluster and the shard count.
This sounds good, but make sure you are using compressed pointers.
We are on the way to migrate the cluster to the latest version but it takes time and we need stable cluster to do it. As mentioned there are 6 similar clusters that behave as expected and we can't figure out what happened with this one recently.
At some point, GC could not free memory, and the heap starts growing.
@warkolm We know that we need to upgrade but this is what we have now and we need to stabilize it first. We keep adding nodes but it's still unstable and we see the same heap behavior.
What would be your suggestion for troubleshooting this issue to understand what is causing such GC behavior even on query nodes?
We encounterred some memory issue when using version 1.7 is making some sort or aggregation operation on some fields without setting doc-values structure, which could cause memory issue at earlier version.