Upper limit on cluster state

I have spent good amount of time reading multiple blogs and documents, both on elasticsearch and other websites to understand the scalability limits of Elasticsearch.

It looks like Cluster state is one of the critical data structures that needs to be monitored. Since each node keeps information about the the cluster state, it needs to be limited. Number of indices, shards, fields in an index, etc...). I was wondering if there is an upper limit on the size of cluster state that we should be monitoring.

We have over 3k indices in the cluster, but the cluster state is only about 20MBs, which seems pretty reasonable given that only delta is forward to other nodes within the cluster in a compressed format. Any pointers would be really useful. Thank you.

1 Like

I'm not sure if there is a hard upper limit, but I can share a terrible experience I had while trying to troubleshoot someone's 1.7 cluster three weeks ago. Though it had many problems including severe memory pressure, the biggest problem seemed to be that the cluster state wasn't getting propagated in a timely manner.

curl -XGET 'http://localhost:9200/_cluster/pending_tasks?pretty'
  "tasks" : [ {
    "insert_order" : 14670,
    "priority" : "HIGH",
    "source" : "update-mapping [terrible_index][result] / node [3TMP9qc7S-SXjr4leMpyKQ], order [1]",
    "executing" : true,
    "time_in_queue_millis" : 270763,
    "time_in_queue" : "4.5m"
  }, {
    "insert_order" : 14671,
    "priority" : "HIGH",
    "source" : "update-mapping [terrible_index][result] / node [3TMP9qc7S-SXjr4leMpyKQ], order [2]",
    "executing" : false,
    "time_in_queue_millis" : 269696,
    "time_in_queue" : "4.4m"

In https://www.elastic.co/guide/en/elasticsearch/guide/current/_pending_tasks.html, I found this tidbit:

Since a cluster can have only one master, only one node can ever process cluster-level metadata changes. For 99.9999% of the time, this is never a problem. The queue of metadata changes remains essentially zero.

In some _rare_ clusters, the number of metadata changes occurs faster than the master can process them. 

Imagine my dismay when the queue I was looking at took 45 minutes to empty.

That index turned out to be pretty small at just 1100 documents. But the mapping JSON was 4 million lines long and growing. I deleted the index and everything has been green since. This cluster still has several hundred indices and ~3K shards, but I haven't seen any pending cluster tasks anymore.

So my pointer would be to check _cluster/pending_tasks periodically to ensure your cluster isn't in the 0.0001%. :wink:

1 Like

Excessive shard+index counts, excessive field counts, excessive mapping changes, excessive sparse data are all things that can blow out cluster state.

I don't know if there is a specific size to be worried about though. It was only in the older 1.X/2.X series that had problems with large states.

@loren Thank you so much for sharing your experience. I will keep an eye on the pending tasks and add an alert for it.

4 Million lines sounds pretty crazy. I believe in new versions, only diff is sent instead of full cluster state in compressed format. See more on it here. BTW, can you answer a few more questions so that I understand your scenario a bit more and compare it with mine?

  • How many nodes did you have in cluster?
  • What is/was the hardware configuration?
  • Do you remember cluster state size?

@warkolm Thank you, Mark. Does this mean that the new version (we use v5.4) handles cluster state better because of the optimizations? Are the factors that you mentioned (shard + index count, field count) still causes of concern if the cluster state is reasonably smaller (may be 100MB or so)? Our cluster has 5-10 data nodes with 16 cores and 30GB of RAM. It also has 3 dedicated master nodes with 4 cores and 7 GB or RAM.

How many shards on your cluster?

About 24k as of now.

@animageofmine The cluster had 14 data nodes (r4.xlarge) with 16GB heaps. About 400 indices over 3000 shards. I don't recall the cluster state size specifically, but I remember seeing 14GB+ in the old generations. Everything was on-heap. The nodes were constantly trying to garbage collect but not getting anywhere, so CPU stayed at 100%. Meanwhile, that 4 million line mapping would get updated, sometimes repeatedly in a short period, and the cluster task list would just grow and grow as it tried to propagate the cluster state. Every setting you're not supposed to touch had been customized quite a bit, so it was difficult to tease apart the symptoms.

That's wwwwwwaaaaaaaaaaaaayyyyyyyyyyyyyy too many and will be causing you problems, irrespective of the cluster state size.

@loren Thank you so much for the details. Definitely helps me out with the design.

@warkolm That sounds like a red flag. Would you mind elaborating why would it cause problems? What else should I be looking at other than cluster state to determine the scalability limits? BTW, the cluster seems to be doing alright.

Each shard is a lucene instance and requires a certain amount of heap to remain active. Elasticsearch will try to remove indices that aren't being used from active memory, but what happens if someone decides to query data that ranges over all your data?

Essentially, you are wasting heap.

You should look to use _shrink to reduce the count of older indices that don't need many shards.

Does this mean that heap size is a good indicator of pressure on the cluster? We have GC tool that cleans up indices that are not accessed in last N days because we don't need to persist stale data.

What happens if we have too many small shards? Would increasing the heap size help in this case? I am trying to determine if there are any viable options that would allow us to continue using Elasticsearch for our use case (too many small indices since it is controlled by our customers). One of our plans is to build multiple small clusters, but that was more to reduce the blast radius.

Have a look at:

You can increase heap, but you are still wasting some.

If you have the need for lots of little indices, maybe rethink things;

  • is the need driven by time based data?
  • can that be fulfilled with weekly/monthly/yearly indices for super small datasets
  • if it is time based, why not use _shrink, you have nothing to lose

Having multiple clusters makes a lot of sense, it's something we see users end up with after having a massive, monolithic cluster that tends to become a lot of work to maintain. That's the whole idea behind Elastic Cloud Enterprise actually :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.