ES goes out of heap when issuing clusterstats (caused by CompletionStats)

We observed this behavior in several of our production ElasticSearches and we were also able to reproduce it locally.
If a database contains a lot of data for the completion-suggester, issuing a "_cluster/stats?pretty" causes a sudden out-of-heap.
It creates a steep increase of needed heap (e.g. 300MB to >3G within fractions of a second).
We traced the issue down with the YourKit-profiler.

The increase is caused by the method:
'public CompletionStats get(String... fieldNamePatterns)'
in file
'elasticsearch/server/src/main/java/org/elasticsearch/index/engine/CompletionStatsCache.java'

There is this comment+code that seems already to describe the root-cause:
'// TODO: currently we load up the suggester for reporting its size'
'final long fstSize = ((CompletionTerms) terms).suggester().ramBytesUsed();'

Even if we size the ES big enough to handle that request, it seems to never release that memory again (also issuing manual GCs don't help).
See this screenshot for an example where all the heap resides in.

As we don't need that information for CompletionStats, is there a way to disable it? We can't make sure that someone isn't using a management tool that issues the stats-API call.

Of course finding another way to get that stats-info would be the best.

What version are you using?

AIUI the memory used and retained by this feature is just the memory needed for searches involving "type": "completion" fields. Do you have a lot of those fields? If you're not using these fields for suggest queries then it would be best to trim down your mappings.

Thanks for responding.

Completion is a key-feature in that use-case. We can't just remove it.
There is also no problem during searches. Everything works fast, smooth without using much heap. As expected so to say. The problem only occurs when someone issues the '_cluster/stats' call.

Edit: This has been observed starting with 8.5x (probably was there before but this was the earliest version we saw it)

This indicates that you have a lot of completion fields which are unused by your searches.

The screenshot you shared shows that a few IndexShard instances retain quite a lot of heap, but could you expand that to confirm that it really is completion stats that causes you a problem?

Is it just GET _cluster/stats which causes the problem, or do you see the same issues with GET _nodes/stats and GET _stats?

Sorry for coming back a bit late, I had to setup a system first.
Yes, all these URLs also lead to the same behavior.

See this screenshot why I believe that it's caused by the completion-suggester, all the data in the IndexShard comes from CompletionFieldsProducer

Currently I'm reducing the amount of data that is put into the completion-fields (source) and see if I can solve the problem in that way.

The IndexShard instance you picture retains ~12.2MiB, of which ~3.3MiB (27%) is related to completion fields.

AIUI size = 15 indicates you have 15 completion fields in this shard's mappings. Do you need all of those? The fact that searches work but stats don't indicates that a substantial fraction of your completion fields are not being used by searches.

Yes, this is a kind separation of data (languages). As we use contexts anyways, would moving the separation per language into the context and just using one completion field make the situation better from your perspective?

And thanks again for helping out!

If you're really using all those fields in your searches then that's ok, you can leave them alone. I'm just trying to understand why the stats calls are loading completion data that isn't loaded by searches.

I did another run with only one field and less data to make sure ES is still running.
I did a heapdump right before the stats-call and one heapdump right after the stats-call.
It increases by nearly 400MB, this data will also stay on heap. It's not temporary but doesn't increase when issuing stats a 2nd time.
Btw: the IndexService size in the heap-dump somewhat correlates with this part of the stats output:

"completion" : {
"size_in_bytes" : 330962406
},

Before:

was only allowed to include one image, so here the 2nd

After:

Edit: if helpful, I can provide both heapdumps.

That's the behaviour I would expect. This data is loaded on-demand, either when needed for a search involving the completion field or when computing their stats. Since it's not being loaded by searches, only by stats calls, it seems that you aren't really using all these fields in your searches.

FWIW I expect the data is dropped again when the underlying segments change (i.e. on a refresh).

All completions come from that field and we are using it (it was not used in the first heapdump, that was taken right after importing the data).
It doesn't use any noticeable memory. There is also an API-calls (our API) which only delivers completion results. E.g. it only uses the completion-feature. They only thing I could imagine is that it doesn't fill those structure completely on any search but it would if all possible combinations would be queried.

Anyway we are now removing this and will use another approach to get completions because even if that's the intended behavior the pressure on heap is too high and unpredictable (e.g. we don't know what customers import).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.