Hi,
Recently we are experiencing unexpected elasticsearch node restarts with Terminating due to java.lang.OutOfMemoryError: Java heap space
. We do not exactly know why this happens and what is causing it.
We have been increasing the node heap settings a few times from 32GB to 52GB to now 64GB per node, and it is still happening.
Here are the details of our 10 nodes, each having:
- 256GB RAM (64GB Heap)
- 64 vCPU
- Elasticsearch 8.17.5
Note: We have 2 node groups g1 and g2, each having 5 nodes.
We have two main indices (development index and production index) on the cluster. Each index has the following properties:
- ~100M documents (1.6b total)
- Multiple nested mappings
- Large text fields
- HNSW index for dimension 768 (dense)
- Size of the index is 15TB
- 280 shards
Note: The production index holds one replica, the development index does not have a replica. In terms of data and mappings, both indices are equivalent.
Note: The development index is on the g1 node group and the production index is on the g2 node group.
Recently, we have been making some changes to the indices after which these issues slowly started to appear. I am not entirely sure, whether it is causal or coincidental, but I wanted to share these details for context:
- We have added an expensive
top_hits
aggregation for a maximum of 40 aggregation buckets, each returning asize
of 10. - We have added
search_as_you_type
(index_options=offsets) mapping to some of our text fields. These text fields can have the size of a 1-5 paragraphs. - We have added the following field mappings to a field
company_name
to build autocomplete suggesters:
{
"keyword": {"type": "keyword", "ignore_above": 256},
"keyword_lowercased": {
"type": "keyword",
"normalizer": "lowercased_keyword",
"ignore_above": 256,
},
"first_token": {
"type": "text",
"analyzer": "standard_one_token_limit",
"search_analyzer": "standard_lowercase",
},
"edge_ngram": {
"type": "text",
"analyzer": "standard_edge_ngrams",
"search_analyzer": "standard_lowercase",
},
"first_token_edge_ngram": {
"type": "text",
"analyzer": "standard_first_token_edge_ngrams",
"search_analyzer": "standard_lowercase",
},
"suggest": {"type": "completion"},
}
Note: The first_token
, edge_ngram
, first_token_edge_ngram
, suggest
fields are NOT yet used in our application. It is only indexed for later development efforts. So these fields are not used at query time currently.
Note: Every night, we update our index with new incoming data.
I would like to emphasize that:
- We have operated the index without a problem for a year or so on 32GB heap.
- Recently, we saw out of heap issues and increased the heap to 52GB. We were still seeing issues, and increased it to 64GB.
- Yesterday we still saw issues with 64GB of heap.
Can anyone help me what could cause the problems that we are seeing?
Thanks in advance.