We are using Elasticsearch version 8.13, We have continuous heavy aggregation queries, it is making whole cluster unstable

We have a heavy load aggregation , which cause circuit breaker quite often, I want to handle this before hand as the more ciicuit breakers are making our whole cluster unstable and many times it is making node with .security index down which leads to whole cluster unauthenticate, Is there any strategy or solution where anyone have beforehand predicted the queries which may lead to circuit breaker and than can reject such queries by observing the current cluster condition.

Unstable cluster means more number of unassigned shards and numbers of nodes going back and forth from the cluster due to failed GC and hprof dumps (many other repcursions due to circuit breaker on many nodes)

What is the full output of the cluster stats API?

What is the hardware specification of the cluster?

Hello @christian , Attaching mu cluster stats API output:

{
  "_nodes": {
    "total": 337,
    "successful": 337,
    "failed": 0
  },
  "cluster_name": "DC01",
  "cluster_uuid": "TxcEnmxHTEiKXr7SWgdvug",
  "timestamp": 1748579374009,
  "status": "yellow",
  "indices": {
    "count": 755,
    "shards": {
      "total": 29567,
      "primaries": 10121,
      "replication": 1.9213516450943582,
      "index": {
        "shards": {
          "min": 1,
          "max": 45,
          "avg": 39.16158940397351
        },
        "primaries": {
          "min": 1,
          "max": 15,
          "avg": 13.405298013245034
        },
        "replication": {
          "min": 0,
          "max": 2,
          "avg": 1.8184547461368603
        }
      }
    },
    "docs": {
      "count": 861331304181,
      "deleted": 2644087
    },
    "store": {
      "size_in_bytes": 1539496814254091,
      "total_data_set_size_in_bytes": 1539496814254091,
      "reserved_in_bytes": 0
    },
    "fielddata": {
      "memory_size_in_bytes": 1518415044496,
      "evictions": 0,
      "global_ordinals": {
        "build_time_in_millis": 278984857
      }
    },
    "query_cache": {
      "memory_size_in_bytes": 60264068695,
      "total_count": 10449242006,
      "hit_count": 1411351636,
      "miss_count": 9037890370,
      "cache_size": 13428322,
      "cache_count": 15178684,
      "evictions": 1750362
    },
    "completion": {
      "size_in_bytes": 0
    },
    "segments": {
      "count": 1283657,
      "memory_in_bytes": 0,
      "terms_memory_in_bytes": 0,
      "stored_fields_memory_in_bytes": 0,
      "term_vectors_memory_in_bytes": 0,
      "norms_memory_in_bytes": 0,
      "points_memory_in_bytes": 0,
      "doc_values_memory_in_bytes": 0,
      "index_writer_memory_in_bytes": 481948384,
      "version_map_memory_in_bytes": 36471297,
      "fixed_bit_set_memory_in_bytes": 72990016,
      "max_unsafe_auto_id_timestamp": 1748578940966,
      "file_sizes": {}
    },
    "mappings": {
      "total_field_count": 241973,
      "total_deduplicated_field_count": 13089,
      "total_deduplicated_mapping_size_in_bytes": 82765,
      "field_types": [
        {
          "name": "alias",
          "count": 76,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "binary",
          "count": 5,
          "index_count": 5,
          "script_count": 0
        },
        {
          "name": "boolean",
          "count": 175,
          "index_count": 40,
          "script_count": 0
        },
        {
          "name": "byte",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "constant_keyword",
          "count": 6,
          "index_count": 2,
          "script_count": 0
        },
        {
          "name": "date",
          "count": 5032,
          "index_count": 713,
          "script_count": 0
        },
        {
          "name": "date_nanos",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "date_range",
          "count": 7,
          "index_count": 7,
          "script_count": 0
        },
        {
          "name": "double",
          "count": 2684,
          "index_count": 677,
          "script_count": 0
        },
        {
          "name": "double_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "flattened",
          "count": 39,
          "index_count": 6,
          "script_count": 0
        },
        {
          "name": "float",
          "count": 16179,
          "index_count": 687,
          "script_count": 0
        },
        {
          "name": "float_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "geo_point",
          "count": 1363,
          "index_count": 673,
          "script_count": 0
        },
        {
          "name": "geo_shape",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "half_float",
          "count": 78,
          "index_count": 22,
          "script_count": 0
        },
        {
          "name": "integer",
          "count": 173,
          "index_count": 19,
          "script_count": 0
        },
        {
          "name": "integer_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "ip",
          "count": 4068,
          "index_count": 675,
          "script_count": 0
        },
        {
          "name": "ip_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "keyword",
          "count": 83256,
          "index_count": 718,
          "script_count": 0
        },
        {
          "name": "long",
          "count": 40821,
          "index_count": 706,
          "script_count": 0
        },
        {
          "name": "long_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "match_only_text",
          "count": 270,
          "index_count": 3,
          "script_count": 0
        },
        {
          "name": "nested",
          "count": 92,
          "index_count": 21,
          "script_count": 0
        },
        {
          "name": "object",
          "count": 8867,
          "index_count": 712,
          "script_count": 0
        },
        {
          "name": "scaled_float",
          "count": 24,
          "index_count": 6,
          "script_count": 0
        },
        {
          "name": "shape",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "short",
          "count": 22,
          "index_count": 8,
          "script_count": 0
        },
        {
          "name": "text",
          "count": 78651,
          "index_count": 691,
          "script_count": 0
        },
        {
          "name": "version",
          "count": 10,
          "index_count": 10,
          "script_count": 0
        },
        {
          "name": "wildcard",
          "count": 66,
          "index_count": 3,
          "script_count": 0
        }
      ],
      "runtime_field_types": []
    },
    "analysis": {
      "char_filter_types": [],
      "tokenizer_types": [],
      "filter_types": [],
      "analyzer_types": [],
      "built_in_char_filters": [],
      "built_in_tokenizers": [],
      "built_in_filters": [],
      "built_in_analyzers": []
    },
    "versions": [
      {
        "version": "8.9.0",
        "index_count": 755,
        "primary_shard_count": 10121,
        "total_primary_bytes": 525537020857082
      }
    ],
    "search": {
      "total": 305339,
      "queries": {
        "bool": 304775,
        "prefix": 137998,
        "match": 7916,
        "range": 253596,
        "nested": 3,
        "wildcard": 1,
        "match_phrase": 203755,
        "terms": 209400,
        "match_phrase_prefix": 22,
        "match_all": 59278,
        "exists": 45442,
        "term": 63008,
        "simple_query_string": 4923
      },
      "sections": {
        "highlight": 54210,
        "runtime_mappings": 217,
        "stored_fields": 67222,
        "query": 304786,
        "script_fields": 67222,
        "pit": 148,
        "_source": 8084,
        "terminate_after": 27,
        "docvalue_fields": 60726,
        "fields": 651,
        "collapse": 2406,
        "aggs": 215320
      }
    }
  },
  "nodes": {
    "count": {
      "total": 337,
      "coordinating_only": 32,
      "data": 0,
      "data_cold": 0,
      "data_content": 300,
      "data_frozen": 0,
      "data_hot": 150,
      "data_warm": 150,
      "index": 0,
      "ingest": 0,
      "master": 5,
      "ml": 0,
      "remote_cluster_client": 0,
      "search": 0,
      "transform": 0,
      "voting_only": 0
    },
    "versions": [
      "8.9.0"
    ],
    "os": {
      "available_processors": 12252,
      "allocated_processors": 12252,
      "names": [
        {
          "name": "Linux",
          "count": 337
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Ubuntu 22.04.2 LTS",
          "count": 337
        }
      ],
      "architectures": [
        {
          "arch": "amd64",
          "count": 337
        }
      ],
      "mem": {
        "total_in_bytes": 22480602079232,
        "adjusted_total_in_bytes": 22480602079232,
        "free_in_bytes": 2112391725056,
        "used_in_bytes": 20368210354176,
        "free_percent": 9,
        "used_percent": 91
      }
    },
    "process": {
      "cpu": {
        "percent": 304
      },
      "open_file_descriptors": {
        "min": 7997,
        "max": 12615,
        "avg": 11178
      }
    },
    "jvm": {
      "max_uptime_in_millis": 16464766193,
      "versions": [
        {
          "version": "20.0.2",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "20.0.2+9-78",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 337
        }
      ],
      "mem": {
        "heap_used_in_bytes": 5848185362104,
        "heap_max_in_bytes": 11450382811136
      },
      "threads": 82369
    },
    "fs": {
      "total_in_bytes": 4143449653825536,
      "free_in_bytes": 2397869976317952,
      "available_in_bytes": 2395968995168256
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 337
      },
      "http_types": {
        "security4": 337
      }
    },
    "discovery_types": {
      "multi-node": 337
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "deb",
        "count": 337
      }
    ],
    "ingest": {
      "number_of_pipelines": 10,
      "processor_stats": {
        "date": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "dot_expander": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "foreach": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "geoip": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "json": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "pipeline": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "remove": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "rename": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "script": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "set": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "set_security_user": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "uri_parts": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        },
        "user_agent": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time_in_millis": 0
        }
      }
    },
    "indexing_pressure": {
      "memory": {
        "current": {
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0
        },
        "total": {
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating_in_bytes": 0,
          "primary_in_bytes": 0,
          "replica_in_bytes": 0,
          "all_in_bytes": 0,
          "coordinating_rejections": 0,
          "primary_rejections": 0,
          "replica_rejections": 0
        },
        "limit_in_bytes": 0
      }
    }
  },
  "snapshots": {
    "current_counts": {
      "snapshots": 0,
      "shard_snapshots": 0,
      "snapshot_deletions": 0,
      "concurrent_operations": 0,
      "cleanups": 0
    },
    "repositories": {}
  }
}

Hardware specification are like for HOT and WARM we are using NVMe disk , of around 11TB , so no doubt the hardware are of high end.

It looks like you are using version 8.9.0 and not 8.13 as you stated in the subject of the thread. This is getting quite old, and I do not know if there have been improvements that would affect your use case in newer versions. I will leave that for someone with more inside knowledge to comment on.

If I calculate correctly it looks like you have an average shard size of around 50GB, which looks good. With 29567 shards it does look like the data nodes on average hold just below 5TB of data. I assume warm nodes hold considerably more as you have a hot-warm architecture. In total that is about 1.5PB to potentially aggregate over, which is a lot. As we do not know anything about the queries in terms of nature, frequency and how large part of the total data set they cover it is hard for me to give any recommendations. Maybe someone else has any ideas?

1 Like

Hi @KunwarAkanksha

"nodes": {
    "count": {
      "total": 337,
      "coordinating_only": 32,
      "data": 0,
      "data_cold": 0,
      "data_content": 300,
      "data_frozen": 0,
      "data_hot": 150,
      "data_warm": 150,

First, That is an Impressive Cluster Kudos to you and your team

  • 337 Nodes
  • 755 indices
  • ~ 2 TB / indices
  • ~ 50GB / Shard
  • ~1.5PB Storage

With respect to this that will take a lot of details and deep diving and you are your team are smart so I am sure you have looked at this... Scope and Scale of the Searches and Aggregations etc. as @Christian_Dahlqvist suggested.

Newer versions definitely help scale and stability as well

In addition perhaps transforms could pre aggregate data to help with the heavy aggregation queries but that is just a thought with no facts / data backing.

One thought; What I am seeing in the field and practice (and I see many of these) is the trend towards more smaller clusters vs less larger clusters... and then link them with CCS which is also a Basic / Free feature.

There are pros and cons to this... but for most my users the pros significantly outweigh the cons

6 x 50 node clusters may be more stable, usable, flexible, manageable etc than 1 x 300 node cluster, beside stable when there are upgrades they happen faster, issues have smaller blast radius etc... just a thought...

Here is a blog for a Telemetry case which may not be your case but pretty interesting...

2 Likes

I'd be curious how many version updates you have done. The confusion between claimed 8.13 and actual 8.9 is also ... interesting.

How many of the 755 indices are being heavily written to simultaneously?

How many of the 755 indices are your "continuous heavy aggregations" hitting simultaneously?

Are these "continuous heavy aggregations" only hitting the indices on hot nodes?

What the main differences between your warm/hot nodes, in spec, and whats the split in data between warm / hot tiers ? You've thrown quite a lot of data nodes (aka $$$) at the warm nodes. In your words, why ?

The presumably frequent data migration between tiers does not cause any issues?

btw, you asked:

Well, at some level elasticsearch tries to do a little bit of that itself, there's some situations where it simply rejects queries, but in general case, thats a very hard ask for a general, all-situations-covered strategy/solution. But I'm curious as to why the question, do you have some hunch that its "bad/dumb/unwise" queries that is causing your instability? If so, what's basis of the hunch ? If not, and we can presume they are sensible queries with a solid business case behind them, then you just need to find a way to service them.

1 Like

Will try this, I think this will be more helpful for the aggregation and types of the queries we have.
Sorry for confusion this cluster is 8.9 only .