ES goes out of heap when issuing clusterstats (caused by CompletionStats)

mgsag · June 26, 2023, 6:19am

We observed this behavior in several of our production ElasticSearches and we were also able to reproduce it locally.
If a database contains a lot of data for the completion-suggester, issuing a "_cluster/stats?pretty" causes a sudden out-of-heap.
It creates a steep increase of needed heap (e.g. 300MB to >3G within fractions of a second).
We traced the issue down with the YourKit-profiler.

The increase is caused by the method:
'public CompletionStats get(String... fieldNamePatterns)'
in file
'elasticsearch/server/src/main/java/org/elasticsearch/index/engine/CompletionStatsCache.java'

There is this comment+code that seems already to describe the root-cause:
'// TODO: currently we load up the suggester for reporting its size'
'final long fstSize = ((CompletionTerms) terms).suggester().ramBytesUsed();'

Even if we size the ES big enough to handle that request, it seems to never release that memory again (also issuing manual GCs don't help).
See this screenshot for an example where all the heap resides in.

As we don't need that information for CompletionStats, is there a way to disable it? We can't make sure that someone isn't using a management tool that issues the stats-API call.

Of course finding another way to get that stats-info would be the best.

DavidTurner · June 26, 2023, 7:20am

What version are you using?

AIUI the memory used and retained by this feature is just the memory needed for searches involving "type": "completion" fields. Do you have a lot of those fields? If you're not using these fields for suggest queries then it would be best to trim down your mappings.

mgsag · June 26, 2023, 7:43am

Thanks for responding.

Completion is a key-feature in that use-case. We can't just remove it.
There is also no problem during searches. Everything works fast, smooth without using much heap. As expected so to say. The problem only occurs when someone issues the '_cluster/stats' call.

Edit: This has been observed starting with 8.5x (probably was there before but this was the earliest version we saw it)

DavidTurner · June 26, 2023, 8:27am

This indicates that you have a lot of completion fields which are unused by your searches.

The screenshot you shared shows that a few IndexShard instances retain quite a lot of heap, but could you expand that to confirm that it really is completion stats that causes you a problem?

Is it just GET _cluster/stats which causes the problem, or do you see the same issues with GET _nodes/stats and GET _stats?

mgsag · June 27, 2023, 11:21am

Sorry for coming back a bit late, I had to setup a system first.
Yes, all these URLs also lead to the same behavior.

See this screenshot why I believe that it's caused by the completion-suggester, all the data in the IndexShard comes from CompletionFieldsProducer

Currently I'm reducing the amount of data that is put into the completion-fields (source) and see if I can solve the problem in that way.

DavidTurner · June 27, 2023, 11:28am

The IndexShard instance you picture retains ~12.2MiB, of which ~3.3MiB (27%) is related to completion fields.

AIUI size = 15 indicates you have 15 completion fields in this shard's mappings. Do you need all of those? The fact that searches work but stats don't indicates that a substantial fraction of your completion fields are not being used by searches.

mgsag · June 27, 2023, 11:45am

Yes, this is a kind separation of data (languages). As we use contexts anyways, would moving the separation per language into the context and just using one completion field make the situation better from your perspective?

And thanks again for helping out!

DavidTurner · June 27, 2023, 2:41pm

If you're really using all those fields in your searches then that's ok, you can leave them alone. I'm just trying to understand why the stats calls are loading completion data that isn't loaded by searches.

mgsag · June 28, 2023, 10:51am

I did another run with only one field and less data to make sure ES is still running.
I did a heapdump right before the stats-call and one heapdump right after the stats-call.
It increases by nearly 400MB, this data will also stay on heap. It's not temporary but doesn't increase when issuing stats a 2nd time.
Btw: the IndexService size in the heap-dump somewhat correlates with this part of the stats output:

"completion" : {
"size_in_bytes" : 330962406
},

Before:

mgsag · June 28, 2023, 10:51am

was only allowed to include one image, so here the 2nd

After:

Edit: if helpful, I can provide both heapdumps.

DavidTurner · June 28, 2023, 1:05pm

That's the behaviour I would expect. This data is loaded on-demand, either when needed for a search involving the completion field or when computing their stats. Since it's not being loaded by searches, only by stats calls, it seems that you aren't really using all these fields in your searches.

FWIW I expect the data is dropped again when the underlying segments change (i.e. on a refresh).

mgsag · June 28, 2023, 1:21pm

All completions come from that field and we are using it (it was not used in the first heapdump, that was taken right after importing the data).
It doesn't use any noticeable memory. There is also an API-calls (our API) which only delivers completion results. E.g. it only uses the completion-feature. They only thing I could imagine is that it doesn't fill those structure completely on any search but it would if all possible combinations would be queried.

Anyway we are now removing this and will use another approach to get completions because even if that's the intended behavior the pressure on heap is too high and unpredictable (e.g. we don't know what customers import).

system · July 26, 2023, 1:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Troubleshooting High heap usage Elasticsearch	6	3725	March 8, 2018
Outofmemory exception on ES 1.7.0 Elasticsearch	10	1781	July 5, 2017
Unexplained failure java.lang.OutOfMemoryError: Java heap space Elasticsearch	15	1098	May 21, 2021
Context Suggester: measures to limit memory usage Elasticsearch	1	662	December 3, 2020
ES Heap to 100% and cluster halt Elasticsearch	1	371	July 6, 2017

ES goes out of heap when issuing clusterstats (caused by CompletionStats)

Related topics