Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2

the query with 0 fail. So no profiling.
For me the problems is solved. I have a workaround.

With other index I have similar problem. 6sec vs 44ms.
I am sure it can be preproduce with any large index.

How can I share here big file?

For big text files we normally recommend something like github gist

That's true, but it's not obviously related - it's doing the same amount of work either way. Instead it sounds like there's been an unexpected regression in the performance of a certain kind of query. It'd be very useful if you could pin it down so we can open a bug report. I expect that enabling the query slow log should help, but eliminating clients one-by-one is a reliable approach too.

The tricky part here is we have not seen this happen in any of our dev or stage clusters and we are behind a GraphQL use case where we have hundreds of clients that each can send a few unique queries each (with changing parameters) and we can't just turn off the clients. We haven't been able to force the issue and stage and obviously don't want to put prod back to a scenario where we impact clients.

We have slow query logging on the calling API side and in our non-comprehensive analysis (there are thousands of logged queries during these queue timings) we didn't notice anything abnormal or any spike in volume above normal. Another oddity is we only noticed it would start happening after 4-5 hours of running fine regardless of when we turned traffic back on.

Back on 8.6.2 in the region/cluster we saw impact on, other clusters in that region are on 8.12.2 still and we are really hoping we can help find the root cause here to help get a fix out because restoring those clusters (with many many datasets that require intricate execution to keep fully matched with our other region/datastores upon restore) would be a huge undertaking.

One other note, the impacted region's cluster only has one large (multi TB) index in it and the shards are above the normal <50GB recommendation, they are at ~85 GB right now, just thought I would bring that up as a potential clue as I assume those shards will have more/larger segments than most of our indices in other clusters.

This is my hot threads result.
When the query is active.
My index has 10 shards, about 50 segments per shard and avg size 45G

I cannot do too many tests on production cluster.

https://gist.github.com/AmosIturan/8d5fa73fdeb0f821616f5cc6a33e0a8a

Thank you for the Hot threads output.

I see you are running the query with the profiler set to true. I guess this is part of your investigation?

The output confirms one of my suspicion as we see lots of them hitting this method

app/org.elasticsearch.server@8.12.2/org.elasticsearch.index.fielddata.FieldData$9.nextValue(FieldData.java:313)
       app/org.elasticsearch.server@8.12.2/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$StandardTermsResults.collectZeroDocEntriesIfNeeded(MapStringTermsAggregator.java:389)
       app/org.elasticsearch.server@8.12.2/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator$ResultStrategy.buildAggregations(MapStringTermsAggregator.java:247)
       app/org.elasticsearch.server@8.12.2/org.elasticsearch.search.aggregations.bucket.terms.MapStringTermsAggregator.buildAggregations(MapStringTermsAggregator.java:111)
       app/org.elasticsearch.server@8.12.2/org.elasticsearch.search.profile.aggregation.ProfilingAggregator.buildAggregations(ProfilingAggregator.java:75)

This method is collecting the zero doc buckets which iterates over the data in all segments. This means that if the query is being collected in parallel, we will be doing more work than before which might explain the behaviour you are describing.

I have not being able to reproduce it locally because as you noted, volume is a factor but we believe this is not the right behaviour and we just merged a PR that disables parallel collection when min_doc_count in the terms aggregation is 0, so we do the same work as before.

1 Like

Yes, It was running from Kibana devtools. And I left profile : true from previous test.

In my case because of the index size it actually made the service so slow that it was not really possible to work. and all nodes CPU raised to 100.

It will probably affect others, and will be hard for them to know where this is coming from.

FWIW, I don't see any references of min_doc_count in our code base or logs so unfortunately I don't think that fix covers our regression. We don't do many aggregations in our use case.

Are you able to get hot threads and share it with us? I it should provide lots of information on your case.

At any time or are you only interested when the queues are there?

when the queues builds up so we can see where we are spending time,

That would require production impact to clients which is not something we certainly cannot due on purpose. We have also reverted the cluster we saw this issue to 8.6.2 at this time.

I understand. It is going to be difficult to understand your case, you might want to try to run hot threads even with the current system when it is busy. Maybe it shows some clues.

Hey all, for future readers: we have merged a fix for this problem, see Disable parallel collection for terms aggregation with min_doc_count equals to 0 by iverase · Pull Request #106156 · elastic/elasticsearch · GitHub .

Just to clarify, that fix does not (AFAIK) address the issue I reported. It addresses the issue that Amos66 reported but does not cover the issue we have been facing, again AFAIK.

Thanks for clarifying. I did not realize that the thread included two separate issues with similar symptoms. We'd love to get more details about the request that is triggering the high CPU usage, like we did for the min_doc_count scenario.

We would too, unfortunately we don't have a repeatable way to make that occur without impacting production traffic. We have several clusters that were updated to 8.12.x and we only see severe issues on one with medium sized infra/high traffic but see some smaller impact on large sized infra/high traffic so wondering if we are "brute forcing" our way out of impact there.

Impacted cluster was rebuilt on 8.6.2, no issues since, we are trying to go to 8.9.2 on that cluster next week.

Only lead I have right now is I noticed some queries that should be extremely performant/are super simple running quite slowly in updated cluster vs non-updated cluster that has the same exact data. Will dig into that more

Thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.