Nodes regularly out of heap

Hi,

Recently we are experiencing unexpected elasticsearch node restarts with Terminating due to java.lang.OutOfMemoryError: Java heap space. We do not exactly know why this happens and what is causing it.

We have been increasing the node heap settings a few times from 32GB to 52GB to now 64GB per node, and it is still happening.

Here are the details of our 10 nodes, each having:

  • 256GB RAM (64GB Heap)
  • 64 vCPU
  • Elasticsearch 8.17.5

Note: We have 2 node groups g1 and g2, each having 5 nodes.

We have two main indices (development index and production index) on the cluster. Each index has the following properties:

  • ~100M documents (1.6b total)
  • Multiple nested mappings
  • Large text fields
  • HNSW index for dimension 768 (dense)
  • Size of the index is 15TB
  • 280 shards

Note: The production index holds one replica, the development index does not have a replica. In terms of data and mappings, both indices are equivalent.
Note: The development index is on the g1 node group and the production index is on the g2 node group.

Recently, we have been making some changes to the indices after which these issues slowly started to appear. I am not entirely sure, whether it is causal or coincidental, but I wanted to share these details for context:

  • We have added an expensive top_hits aggregation for a maximum of 40 aggregation buckets, each returning a size of 10.
  • We have added search_as_you_type (index_options=offsets) mapping to some of our text fields. These text fields can have the size of a 1-5 paragraphs.
  • We have added the following field mappings to a field company_name to build autocomplete suggesters:
{
        "keyword": {"type": "keyword", "ignore_above": 256},
        "keyword_lowercased": {
            "type": "keyword",
            "normalizer": "lowercased_keyword",
            "ignore_above": 256,
        },
        "first_token": {
            "type": "text",
            "analyzer": "standard_one_token_limit",
            "search_analyzer": "standard_lowercase",
        },
        "edge_ngram": {
            "type": "text",
            "analyzer": "standard_edge_ngrams",
            "search_analyzer": "standard_lowercase",
        },
        "first_token_edge_ngram": {
            "type": "text",
            "analyzer": "standard_first_token_edge_ngrams",
            "search_analyzer": "standard_lowercase",
        },
        "suggest": {"type": "completion"},
    }

Note: The first_token, edge_ngram, first_token_edge_ngram, suggest fields are NOT yet used in our application. It is only indexed for later development efforts. So these fields are not used at query time currently.
Note: Every night, we update our index with new incoming data.

I would like to emphasize that:

  • We have operated the index without a problem for a year or so on 32GB heap.
  • Recently, we saw out of heap issues and increased the heap to 52GB. We were still seeing issues, and increased it to 64GB.
  • Yesterday we still saw issues with 64GB of heap.

Can anyone help me what could cause the problems that we are seeing?

Thanks in advance.

Hi,

Given your setup and recent changes, the OutOfMemoryError: Java heap space is most likely caused by query-time heap pressure, especially from:

  1. top_hits aggregation: Even with 40 buckets and size 10, if your documents have large or nested fields, the heap usage can grow fast. Consider using _source filtering to reduce memory load:
"top_hits": {
  "_source": ["company_name", "id"], 
  "size": 10
}
  1. Nested mappings: These increase Lucene-level document count and memory cost during aggregations.
  2. Search-as-you-type + custom analyzers: Even if not used at query time, indexing large text fields with multiple sub-fields (like edge ngrams) increases segment size and memory usage during merges or fielddata load.

Recommendations:

  • Enable heap dump (-XX:+HeapDumpOnOutOfMemoryError) and analyze with MAT/GCEasy.
  • Profile queries using _search/profile to identify memory-intensive parts.
  • Monitor breakers:
indices.breaker.request.log: true
indices.breaker.total.limit: 70%
  • Reduce heap to 32GB and rely more on the OS file cache (better GC).
  • Isolate unused or heavy mappings into a separate index or disable indexing temporarily.
  • Watch GC logs and consider tuning young/old GC thresholds.

Hope this helps you isolate the issue.

Best regards,