Heapdump without timeouts

Hi!

We're running nodes with a substantial amount of heap (150+Gb). I know how to take a dump the heap with jmap. However, dumping a heap this size on our machines will probably take several minutes (if I extrapolate linearly from our test cluster). While dumping the heap, the node will be unresponsive; at least that's what I see. As if a long stop-the world GC is happing; which makes sense.

Is there a way to dump the heap without the node being unresponsive in the process? Or is there a way to temporarily take it 'out of rotation' without (significantly) altering its memory footprint?

Thanks in advance!

Frens Jan

Welcome to our community! :smiley:

Can you elaborate more on why you're a) running such huge heaps and b) why you're looking to take a heap dump?

Hi Mark, thanks for the warm welcome!

To be honest, I don't know. I wasn't involved in building this cluster. Actually, that was why I wanted to take a heap dump ... to find out why having such a huge heap is necessary. We have some 25 TB of primary data in 2500 primary shards in over 1200 indices on 3 machines. We're still on 7.12.

We have a hefty amount of field data, 15-20 GB per node. This explains some of the heap usage, but by no means all of it. I'm investigating whether that's all necessary (a bunch of text fields have fielddata enabled, probably unnecessary). We are having quite often some circuit breaker issues. We're using the real-memory parent (the default).

So that's why I wanted to take a heap dump, to cut a little bit through the mist.

Can you share the output from the _cluster/stats?pretty&human API?

Sure:

{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "...",
  "cluster_uuid": "...",
  "timestamp": 1673477524950,
  "status": "green",
  "indices": {
    "count": 1219,
    "shards": {
      "total": 6294,
      "primaries": 2366,
      "replication": 1.6601859678782755,
      "index": {
        "shards": {
          "min": 2,
          "max": 192,
          "avg": 5.163248564397047
        },
        "primaries": {
          "min": 1,
          "max": 64,
          "avg": 1.940935192780968
        },
        "replication": {
          "min": 1.0,
          "max": 2.0,
          "avg": 1.3478260869565217
        }
      }
    },
    "docs": {
      "count": 48714138187,
      "deleted": 2145319475
    },
    "store": {
      "size": "59tb",
      "size_in_bytes": 64879101222007,
      "reserved": "0b",
      "reserved_in_bytes": 0
    },
    "fielddata": {
      "memory_size": "44gb",
      "memory_size_in_bytes": 47332530488,
      "evictions": 49994
    },
    "query_cache": {
      "memory_size": "16.2gb",
      "memory_size_in_bytes": 17416747965,
      "total_count": 72122481678,
      "hit_count": 1792902510,
      "miss_count": 70329579168,
      "cache_size": 2734921,
      "cache_count": 16601976,
      "evictions": 13867055
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 134746,
      "memory": "3gb",
      "memory_in_bytes": 3240487512,
      "terms_memory": "2gb",
      "terms_memory_in_bytes": 2152327168,
      "stored_fields_memory": "114.9mb",
      "stored_fields_memory_in_bytes": 120544152,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "54.3mb",
      "norms_memory_in_bytes": 56943936,
      "points_memory": "0b",
      "points_memory_in_bytes": 0,
      "doc_values_memory": "868.4mb",
      "doc_values_memory_in_bytes": 910672256,
      "index_writer_memory": "10.5gb",
      "index_writer_memory_in_bytes": 11305844644,
      "version_map_memory": "313.7mb",
      "version_map_memory_in_bytes": 329037232,
      "fixed_bit_set": "12.3gb",
      "fixed_bit_set_memory_in_bytes": 13223038064,
      "max_unsafe_auto_id_timestamp": 1673129031081,
      "file_sizes": {}
    },
    "mappings": {
      "field_types": [
        {
          "name": "alias",
          "count": 13572,
          "index_count": 224
        },
        {
          "name": "binary",
          "count": 197,
          "index_count": 197
        },
        {
          "name": "boolean",
          "count": 10218,
          "index_count": 495
        },
        {
          "name": "date",
          "count": 12978,
          "index_count": 1198
        },
        {
          "name": "dense_vector",
          "count": 13,
          "index_count": 13
        },
        {
          "name": "double",
          "count": 4018,
          "index_count": 468
        },
        {
          "name": "flattened",
          "count": 1118,
          "index_count": 43
        },
        {
          "name": "float",
          "count": 2040,
          "index_count": 77
        },
        {
          "name": "geo_point",
          "count": 1678,
          "index_count": 465
        },
        {
          "name": "integer",
          "count": 704,
          "index_count": 704
        },
        {
          "name": "ip",
          "count": 7482,
          "index_count": 64
        },
        {
          "name": "keyword",
          "count": 245828,
          "index_count": 1213
        },
        {
          "name": "long",
          "count": 65340,
          "index_count": 1212
        },
        {
          "name": "nested",
          "count": 996,
          "index_count": 467
        },
        {
          "name": "object",
          "count": 56194,
          "index_count": 498
        },
        {
          "name": "scaled_float",
          "count": 43,
          "index_count": 43
        },
        {
          "name": "short",
          "count": 6464,
          "index_count": 64
        },
        {
          "name": "text",
          "count": 19555,
          "index_count": 1199
        }
      ]
    },
    "analysis": {
      "char_filter_types": [],
      "tokenizer_types": [],
      "filter_types": [],
      "analyzer_types": [
        {
          "name": "custom",
          "count": 1200,
          "index_count": 424
        }
      ],
      "built_in_char_filters": [],
      "built_in_tokenizers": [
        {
          "name": "icu_tokenizer",
          "count": 808,
          "index_count": 422
        },
        {
          "name": "nori_tokenizer",
          "count": 4,
          "index_count": 2
        },
        {
          "name": "whitespace",
          "count": 388,
          "index_count": 388
        }
      ],
      "built_in_filters": [
        {
          "name": "icu_folding",
          "count": 1200,
          "index_count": 424
        },
        {
          "name": "lowercase",
          "count": 1164,
          "index_count": 388
        }
      ],
      "built_in_analyzers": []
    },
    "versions": [
      {
        "version": "7.7.0",
        "index_count": 262,
        "primary_shard_count": 334,
        "total_primary_size": "2.6tb",
        "total_primary_bytes": 2906280792771
      },
      {
        "version": "7.12.0",
        "index_count": 957,
        "primary_shard_count": 2032,
        "total_primary_size": "19tb",
        "total_primary_bytes": 20991010639243
      }
    ]
  },
  "nodes": {
    "count": {
      "total": 3,
      "coordinating_only": 0,
      "data": 3,
      "data_cold": 3,
      "data_content": 3,
      "data_frozen": 3,
      "data_hot": 3,
      "data_warm": 3,
      "ingest": 3,
      "master": 3,
      "ml": 3,
      "remote_cluster_client": 3,
      "transform": 3,
      "voting_only": 0
    },
    "versions": [
      "7.12.0"
    ],
    "os": {
      "available_processors": 192,
      "allocated_processors": 192,
      "names": [
        {
          "name": "Linux",
          "count": 3
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Debian GNU/Linux 10 (buster)",
          "count": 3
        }
      ],
      "architectures": [
        {
          "arch": "amd64",
          "count": 3
        }
      ],
      "mem": {
        "total": "754.9gb",
        "total_in_bytes": 810611838976,
        "free": "7.7gb",
        "free_in_bytes": 8346095616,
        "used": "747.1gb",
        "used_in_bytes": 802265743360,
        "free_percent": 1,
        "used_percent": 99
      }
    },
    "process": {
      "cpu": {
        "percent": 14
      },
      "open_file_descriptors": {
        "min": 22853,
        "max": 24680,
        "avg": 23663
      }
    },
    "jvm": {
      "max_uptime": "30.3d",
      "max_uptime_in_millis": 2621213593,
      "versions": [
        {
          "version": "11.0.9.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "11.0.9.1+1-post-Debian-1deb10u2",
          "vm_vendor": "Debian",
          "bundled_jdk": true,
          "using_bundled_jdk": false,
          "count": 3
        }
      ],
      "mem": {
        "heap_used": "351.2gb",
        "heap_used_in_bytes": 377191067584,
        "heap_max": "555gb",
        "heap_max_in_bytes": 595926712320
      },
      "threads": 968
    },
    "fs": {
      "total": "82.8tb",
      "total_in_bytes": 91071937265664,
      "free": "23.7tb",
      "free_in_bytes": 26166155014144,
      "available": "23.7tb",
      "available_in_bytes": 26165853024256
    },
    "plugins": [
      {
        "name": "analysis-icu",
        "version": "7.12.0",
        "elasticsearch_version": "7.12.0",
        "java_version": "1.8",
        "description": "The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding ICU-related analysis components.",
        "classname": "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
        "extended_plugins": [],
        "has_native_controller": false,
        "licensed": false,
        "type": "isolated"
      },
      {
        "name": "analysis-nori",
        "version": "7.12.0",
        "elasticsearch_version": "7.12.0",
        "java_version": "1.8",
        "description": "The Korean (nori) Analysis plugin integrates Lucene nori analysis module into elasticsearch.",
        "classname": "org.elasticsearch.plugin.analysis.nori.AnalysisNoriPlugin",
        "extended_plugins": [],
        "has_native_controller": false,
        "licensed": false,
        "type": "isolated"
      }
    ],
    "network_types": {
      "transport_types": {
        "security4": 3
      },
      "http_types": {
        "security4": 3
      }
    },
    "discovery_types": {
      "zen": 3
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "tar",
        "count": 3
      }
    ],
    "ingest": {
      "number_of_pipelines": 2,
      "processor_stats": {
        "gsub": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time": "0s",
          "time_in_millis": 0
        },
        "script": {
          "count": 0,
          "failed": 0,
          "current": 0,
          "time": "0s",
          "time_in_millis": 0
        }
      }
    }
  }
}

To your point about heap dumping, you cannot take one without essentially pausing the JVM to take a snapshot of what it's doing, which means the node will become unresponsive.

That aside your cluster stats explain quite a lot! A few things I can tell you from that;

  1. Compared to how we generally suggest a cluster is built - your nodes are singularly massive
  2. You've got about 20TB stored on each node
  3. You're running around 3100 shards per node
  4. Heap size is indeed substantial
  5. Each shard is about 3GB on average, which is super small

A typical cluster is made up of many nodes with <32GB heap and anywhere from 1-6TB of disk space and 500-800 shards, with each shard being 30-50GB in size.

So a few questions for you that'll help continue the discussion;

  1. What sort of data is this - time based info such as logs/metrics, or something else?
  2. Are these nodes on physical hosts (aka bare metal)?

I understood that I won't be able to take a heap dump without pausing the JVM. I was hopen for an option to take the node out of the cluster, take the heap dump and then let it join again.

We have had some issues where a node would be unresponsive and that typically has quite some fall-out (ingest jobs failing, the obvious query timeouts, etc.).

I am fully aware of the unconventional setup we have. The nodes are indeed on our own bare metal. I'm also familiar with the 32 GB / compressed oops threshold and also the challenges / overhead of garbage collection with such a large heap.

The cluster has grown however to that state that it's in. Operations dealt with OOM's and the parent circuit breaker tripping by adding more heap. Before barging in and advocating changing the cluster's setup, I was hoping to understand better what is going in within the JVM.

The average shard size is indeed way to low. However, that's mostly due to some weird and small indices that are created 1:1 for some 'thing' within our applications. They are mostly for monitoring. Combining these in a single index / using ILM for time / size based indices definitely makes sense.

There about 130 main production indices. They account for ~60 TB of the total data and ~4700 of the shards (pri+rep). The biggest indices of 1+TB have 100-200 shards with shard sizes of 5 to 20GB. Maybe not ideal, but probably not the biggest fish to fry.

The data is something else :slight_smile: Datasets that we build based of source data, with incremental ingest, occasional rebuilds from source.

Thanks for the feedback so far! I'll sit with infra to discuss how we can manage smaller (and more) nodes given the hardware that we have.

Also, I'll investigate how we are using fielddata for text fields ... I have a suspicion that this may have caused issues in combination with large aggregations for term discovery use cases.

I see that you are using dense vectors in your indices. I have not used this myself, so am not sure what impact this would have on heap usage.

I believe there has been a good amount of improvements in more recent versions, so upgrading moght be a good idea.

Thanks for the pointer! I'll look into that.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.