Hot Nodes + Shards/Node assignment + Routing

I have a log ingestion use-case and I'm dealing with high JVM heap on 3 out of 24 data nodes in a cluster.

Indices are Monthly, with new indices created through the month as indices hit a size threshold, e.g. logs-YYYY-MM-000001.

Indices are configured w/24 shards, 2 replicas.

Documents are routed based on a Customer ID. (I want to avoid changing this)

As some customers have far more logs than others, we're finding that the same shard #'s are larger across indices.

We're also seeing that Elasticsearch is generally placing most shard #'s on the same nodes.

This is leading to very high doc count and jvm heap usage on 3 of 24 nodes in the cluster.

I'm thinking if ES were more distributed about what shard #'s it placed on the same data nodes, load would be more balanced and I wouldn't have hot nodes due from our routing config.

We're also considering increasing index.routing_partition_size to >1, but I don't like how this might increase our queries per search.

Does anyone have critiques on this conclusion, and ideas on how to accomplish this?

Are you shooting your stack in the foot with the routing? How many roll overs per month? -00001 to -00what? If very few, it might be better to roll more frequently with much fewer number of shards. WIth fewer shards, routing would have less chance of hot spotting some of them.

Is it always the same 3 nodes? Do you have dedicated master nodes? How many total shards in the cluster? What is your heap/ram size?

Might be shooting self in the foot w/routing, but changing that is more of a long-term thing.

~16 roll-overs per month. e.g. here is shard data Sept, for one data node, which illustrates the shard+routing issue with the same orgs/shard #'s being on this node across all shards/indices. Not sure yet if it's always the same 3 nodes. It is always the same 3 nodes through the course of a month - issues (GC, JVM Heap) don't arise until last 2 weeks of month I believe because of app query patterns. I need to dig back through some historical detail to get better ideas on this.

We have a pool of dedicated master eligible nodes. Total of ~6750 shards, about 280/node. Heap/RAM is 32GB/256GB.

logs-2022-09-000001                   18    r      STARTED  38673842   70.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000001                   19    p      STARTED  18652819     34gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000001                   23    r      STARTED  12225618   22.7gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000002                   15    r      STARTED  94739930  154.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000002                   18    r      STARTED  24689088     45gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000002                   20    p      STARTED   8346789     14gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000003                   15    r      STARTED 107257117  175.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000003                   18    r      STARTED  27338983   50.8gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000003                   20    p      STARTED   8917165   15.4gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000004                   15    r      STARTED 112205462  184.1gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000004                   18    r      STARTED  26746714   49.3gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000004                   20    p      STARTED  10205093   17.4gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000005                   15    r      STARTED  91799082  149.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000005                   18    r      STARTED  27603492   49.9gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000005                   20    p      STARTED   9401615   15.8gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000006                   15    r      STARTED 117559879  192.1gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000006                   18    r      STARTED  23863079   44.3gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000006                   20    p      STARTED   8789412   14.9gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000007                   15    r      STARTED 103834796  170.8gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000007                   18    r      STARTED  27331186     50gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000007                   20    p      STARTED  10653341     18gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000008                   15    r      STARTED  92967483  152.1gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000008                   18    r      STARTED  29427191   53.8gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000008                   20    p      STARTED  10336157   17.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000009                   15    r      STARTED 103273260  167.4gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000009                   18    r      STARTED  21887027   39.2gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000009                   20    p      STARTED   8716463   14.3gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000010                   15    r      STARTED  98266274  161.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000010                   18    r      STARTED  27390698   50.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000010                   20    p      STARTED  11312554   19.2gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000011                   15    r      STARTED  92432462  151.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000011                   18    r      STARTED  28762697     53gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000011                   20    p      STARTED   9973043   16.8gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000012                   15    r      STARTED  87085700  143.9gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000012                   18    r      STARTED  28207923   51.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000012                   20    p      STARTED  11265062   18.7gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000013                   15    r      STARTED  99812438  163.1gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000013                   18    r      STARTED  23719618   42.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000013                   20    p      STARTED   9565145   15.9gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000014                   15    r      STARTED  92335750  150.9gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000014                   18    r      STARTED  29989057   54.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000014                   20    p      STARTED  10850662   18.3gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000015                   15    r      STARTED  94448091  154.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000015                   18    r      STARTED  27495348   50.6gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000015                   20    p      STARTED  11505124     19gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000016                   15    r      STARTED  28778010   47.4gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000016                   18    r      STARTED  12800972   23.5gb 10.1.2.3  data-node-04b.internal
logs-2022-09-000016                   20    p      STARTED   5571068    9.2gb 10.1.2.3  data-node-04b.internal

What version of Elasticsearch? It's odd that it's the last part of the month, is old data moving to another warm/cold layer or being aged off? Do you force merge shards that aren't currently written?

What is the output from the _cluster/stats?pretty&human API?

  • Our ES version is old, but we're working to upgrade. I know this adds a big variable.
  • No data tiering or movement to warm/cold. Data is deleted from cluster after ~6 months.
  • Non-current indices are force-merged.
  • One issue we're dealing with that may be related: We were creating fresh snapshot repos (S3 backed) at the beginning of every months, and by EOM we'd have >800 snapshots in that repo, significantly degrading snapshot performance. That has been fixed for Oct - one long-lived repo, and logic to delete snaps and keep count <50.

Here is output from _cluster/stats:

{
  "_nodes" : {
    "total" : 36,
    "successful" : 36,
    "failed" : 0
  },
  "cluster_name" : "cluster",
  "timestamp" : 1664880527851,
  "status" : "green",
  "indices" : {
    "count" : 112,
    "shards" : {
      "total" : 6921,
      "primaries" : 2299,
      "replication" : 2.0104393214441063,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 72,
          "avg" : 61.794642857142854
        },
        "primaries" : {
          "min" : 1,
          "max" : 24,
          "avg" : 20.526785714285715
        },
        "replication" : {
          "min" : 1.0,
          "max" : 32.0,
          "avg" : 2.25
        }
      }
    },
    "docs" : {
      "count" : 20007624527,
      "deleted" : 46734007
    },
    "store" : {
      "size" : "101.1tb",
      "size_in_bytes" : 111224919772367,
      "throttle_time" : "0s",
      "throttle_time_in_millis" : 0
    },
    "fielddata" : {
      "memory_size" : "12.2gb",
      "memory_size_in_bytes" : 13183287448,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "15.3gb",
      "memory_size_in_bytes" : 16514077608,
      "total_count" : 58118164221,
      "hit_count" : 27773552851,
      "miss_count" : 30344611370,
      "cache_size" : 552739,
      "cache_count" : 48375813,
      "evictions" : 47823074
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 84559,
      "memory" : "164.6gb",
      "memory_in_bytes" : 176749193869,
      "terms_memory" : "127.3gb",
      "terms_memory_in_bytes" : 136765097675,
      "stored_fields_memory" : "23.8gb",
      "stored_fields_memory_in_bytes" : 25562813472,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "5.2mb",
      "norms_memory_in_bytes" : 5453376,
      "points_memory" : "1.7gb",
      "points_memory_in_bytes" : 1839953534,
      "doc_values_memory" : "11.7gb",
      "doc_values_memory_in_bytes" : 12575875812,
      "index_writer_memory" : "790.6mb",
      "index_writer_memory_in_bytes" : 829097505,
      "version_map_memory" : "3.1mb",
      "version_map_memory_in_bytes" : 3323883,
      "fixed_bit_set" : "0b",
      "fixed_bit_set_memory_in_bytes" : 0,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 36,
      "data" : 33,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 36
    },
    "versions" : [
      "5.3.3"
    ],
    "os" : {
      "available_processors" : 462,
      "allocated_processors" : 462,
      "names" : [
        {
          "name" : "Linux",
          "count" : 36
        }
      ],
      "mem" : {
        "total" : "3.4tb",
        "total_in_bytes" : 3828682395648,
        "free" : "104.1gb",
        "free_in_bytes" : 111856631808,
        "used" : "3.3tb",
        "used_in_bytes" : 3716825763840,
        "free_percent" : 3,
        "used_percent" : 97
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 75
      },
      "open_file_descriptors" : {
        "min" : 1095,
        "max" : 1880,
        "avg" : 1672
      }
    },
    "jvm" : {
      "max_uptime" : "290.4d",
      "max_uptime_in_millis" : 25092630336,
      "versions" : [
        {
          "version" : "1.8.0_312",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.312-b07",
          "vm_vendor" : "Amazon.com Inc.",
          "count" : 1
        },
        {
          "version" : "1.8.0_222",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.222-b10",
          "vm_vendor" : "Amazon.com Inc.",
          "count" : 35
        }
      ],
      "mem" : {
        "heap_used" : "583.6gb",
        "heap_used_in_bytes" : 626680627560,
        "heap_max" : "998.1gb",
        "heap_max_in_bytes" : 1071752282112
      },
      "threads" : 6413
    },
    "fs" : {
      "total" : "339.8tb",
      "total_in_bytes" : 373654192852992,
      "free" : "238.1tb",
      "free_in_bytes" : 261850661273600,
      "available" : "238.1tb",
      "available_in_bytes" : 261850661273600
    },
    "plugins" : [
      {
        "name" : "search-guard",
        "version" : "5.3.3-1",
        "description" : "Provide access control related features for Elasticsearch 5",
        "classname" : "com.floragunn.searchguard.SearchGuardPlugin"
      },
      {
        "name" : "discovery-ec2",
        "version" : "5.3.3",
        "description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
        "classname" : "org.elasticsearch.plugin.discovery.ec2.Ec2DiscoveryPlugin"
      },
      {
        "name" : "repository-s3",
        "version" : "5.3.3",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.plugin.repository.s3.S3RepositoryPlugin"
      }
    ],
    "network_types" : {
      "transport_types" : {
        "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 36
      },
      "http_types" : {
        "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 36
      }
    }
  }
}

5.X, yeah wow. I would start by upgrading as there is literally years of improvements around allocation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.