Elasticsearch cluster is crashing often

We have several large clusters v1.7. Please don't ask why :frowning: - we are on the way to upgrade to v7 but need some hints to make it stable.
Recently one of the clusters started crashing several times per day and we see some strange heap behavior.
Other clusters work as expected with a similar write rate.

We have:

  • 2 masters
  • 2 query nodes
  • 50 data nodes

Every node has 30G heap (50% of the VM), ConcMarkSweepGC
Write rate is ~200k doc/min and small number or queries.
There are ~1500 indices with up to 16 shards.

Tried many things and nothing helped so far (different merge policies and index refresh settings). At some point, GC could not free memory and the heap start growing.

As an additional observation - _cat API is very slow and it's mostly impossible to get even _cat/nodes (on other clusters it works as well).

Thanks in advance

1.7 is well into EOL.

My only advice for such an old version would be to add more nodes until such point that you can upgrade.

As already mentioned version 1.7 is very old. I have not used it in years so do not remember it very well apart from that it was hard to control heap usage and that propagating cluster state was slow and inefficient making running large clusters with lots of shards harder than it is nowadays. I do have a few comments though:

You should always have 3 master eligible nodes as this is required for a highly available cluster. You also need to make sure that discovery.zen.minimum_master_nodes is set to 2 in order to avoid split brain scenarios and data loss.

How many shards do you have in the cluster? Are these time-based? How often do you create and/or delete indices in the cluster? Are you using dynamic mappings?

If you have time-based indices and some are no longer written to it makes sense to forcemerge these down to a single segment in order to reduce heap usage.

I believe this often involves the master nodes, which could very well be overloaded given the size of the cluster and the shard count.

This sounds good, but make sure you are using compressed pointers.

We are on the way to migrate the cluster to the latest version but it takes time and we need stable cluster to do it. As mentioned there are 6 similar clusters that behave as expected and we can't figure out what happened with this one recently.

At some point, GC could not free memory, and the heap starts growing.

On a healthy cluster it looks like

@Christian_Dahlqvist there are ~20k shards, with ~450 shards per node.
We have 2 dedicated masters and an additional eligible node.

Are these time-based?

Some of them are immutable time-based and some are mutable and documents could be updated over time.

How often do you create and/or delete indices in the cluster?

Not very often. This is a multi-tenant configuration with 13 indices per tenant and they are created/deleted when we on-board/delete tenant.

Are you using dynamic mappings?

Yes. Some large indices use dynamic mapping.

What is strange is that even on query nodes there is a similar issue with the heap

Just to reiterate what Christian said, this is a super old version with known issues around efficiency of handling large shard/mapping sizes.

Your best bet is to add more nodes till it stabilises and then upgrade.

@warkolm We know that we need to upgrade but this is what we have now and we need to stabilize it first. We keep adding nodes but it's still unstable and we see the same heap behavior.
What would be your suggestion for troubleshooting this issue to understand what is causing such GC behavior even on query nodes?

We encounterred some memory issue when using version 1.7 is making some sort or aggregation operation on some fields without setting doc-values structure, which could cause memory issue at earlier version.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.