Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2

We have been running on Elasticsearch 8.6.2 for several months now with very few issues other than some query (un)optimizations. We have several clusters duplicated in two regions + two non-prod environments with several clusters.

Upon switching to 8.12.1 (and subsequently 8.12.2 because we saw changelog notes regarding connection issue fixes) in one of our production regions we saw large impact to incoming traffic. This did not happen in our dev environment, either of our stage regions, or even our one low volume canary prod single region deployment. It seems volume is a key ingredient in exposing this issue.

It appears that in 8.10 the concept of 'search_worker' was introduced which if I am understanding correctly deviates from the longstanding behavior of one query per shard at a time to now allowing querying of multiple segments of a shard at a time? I bring this up because the impact to traffic (timeouts, rejected transactions) aligns with a huge spike in search and search_worker thread count as well as queues for each of those pools despite nothin in our logs indicating a spike of incoming traffic. The queues only seem to show up several hours after we deploy and are very difficult to force in lower volume environments. They also seem to resolve themselves but go on a rollercoaster pattern after they start. We have ample CPU cycle headroom, peak usage was only 20% on data nodes at that time, heap has headroom, lots of memory on the nodes, etc.

Has anyone come across this behavior? Any suggestions on what to do or troubleshoot?

The detail I am hesitant to share as in the past people check out here, we do have a custom plugin. This plugin filters the result set based on security permissions it retries from an API call that is highly performant and lightweight, we have been using this plugin for years on at times hundreds of thousands of queries at a time. I bring this up because we have had connection timeouts/issus there before leading to queues but those would not resolve themselves and we would see error logs, this time we do not see any error logs and the queues resolve themselves. We stripped our changes down to 8.12.2 only - no other changes at all - to isolate the issue.

Any help is much appreciated

Same here, upgrade with ECK from 8.10.2 to 8.12.2
All node with high CPU (100%) and huge spike in search and search_worker
No errors in logs. Only timeout due to high CPU.

Interesting you came from 8.10.2, I believe the search_worker pool had already been introduced by then. Are you seeing queues in your search and search_worker thread pools on 8.12.2 or just high CPU? What did your CPU peak at before the update?

big jump. From ~0 on 8.10.2
to average of 1k search_worker and 160 search.

It is a big cluster.
On a smaller one that I did 2 weeks ago, I can see the history. It was also raising, but went down after an hour from upgrade.

Got it. That sounds like our situation except we spiked to 1k search queue and a few hundred search_worker queue. Ours happened on our 4th (of 7) biggest cluster but it receives a lot query traffic and has exactly half the VM resources per node (which impacts fixed thread pool defaults) of our biggest 3 clusters where we haven't seen the queue (nor our smallest clusters). Very strange and puts us in a bad situation to try to rebuild from snapshots. We put one region back to 8.6.2 with a manual rebuild and no issues so far.

In our case it seams that Grafana direct query to elastic caused the problem. After disabling the datasource, elastic cpu went down. and no more queues.
We will investigate more tomorrow.

How did you find that? No direct Grafana calls in our use case, we do use Prometheus exporter though

We found it by starting to close all active clients.
When I try to run the generated query on Kibana, it is timing out and CPU raise to 100.
I believe you have similar problem with a query that for some reason takes too long to run on 8.12.2
For me it was running well on 8.10.2

We will check with new Grafana, but this will take some time.

Were you at 100% utilization until you turned that off? Our queues resolved themselves within a few minutes (after impacting thousands of transactions) and would show back up again periodically. It is nearly impossible for us to pin the trigger down due to hundreds of different types of queries being run and we can't just turn them off.

What was the query exactly?

That's true, but it's not obviously related - it's doing the same amount of work either way. Instead it sounds like there's been an unexpected regression in the performance of a certain kind of query. It'd be very useful if you could pin it down so we can open a bug report. I expect that enabling the query slow log should help, but eliminating clients one-by-one is a reliable approach too.

Assuming it is just one kind of query which has slowed down, you should be seeing timeouts or slowness in your clients too. That'd be a good way to find the trigger.

This is the query. I remove private data.
It was running well on 8.10.2. Not working on 8.12.2. I use ECK with 20Node 10Cpu per pod
When I run it in Kibana, I get timeout.

{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "***"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field": "***.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

I think without seeing more detail here it's not going to be possible to pin this down.

It is not really relevant to this problem. Anyway I did modify the query to be more simple and it still failing.

POST applogs-*/_search 
{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "log_info.keywords:send_cmd"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field":"log_info.source_machine_name.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

Thanks. That's pretty simple, I'm surprised there's been a regression here. Can you simplify it any further or is this a minimal example? E.g. what if you remove the inner agg? Or the extended_bounds parameter? Or replace the query_string filter with an equivalent term one?

Once you've got this as small as possible, could you share a profile from both 8.10 (where it's fast) and 8.12 (where it's slow)?

I do not have 8.10.2 cluster anymore.
All clusters been upgraded to 8.12. I have old one 7.6 that I can test.

But I believe if you change the "query" and the "terms.field" you will be able to reproduce on any index.

I did similar test on Discover, and it works fine, but the generated query is different.

I will try to minimize it when I will have some spare time.

OK, Found the query problem.

   "aggs": {
        "2": {
            "terms": {
                "field": "log_info.source_machine_name.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
         ==>>  "min_doc_count": 0
            },

When I change "min_doc_count": "0" to "min_doc_count": "1"
It started to work fine.

Hm interesting. I don't have a dataset to hand on which I can reproduce this. Could you share a profile with min_doc_count: 0 and again with 1?

the query with 0 fail. So no profiling.
For me the problems is solved. I have a workaround.

With other index I have similar problem. 6sec vs 44ms.
I am sure it can be reproduce with any large index.

When I look into the result, I see that with min_doc_count:0 i get all bucket with 0 count. Meaning I get much more buckets. And I believe it make sense.
The question why it is opening so many search threads on 8.12.2 comparing to 8.10.2 is unknown for me, and I cannot test it as I do not have 8.10.2 cluster.

I wonder if it would be possible to share the output of the nodes hot threads while running the query with min_doc_count: 0. It should tell us where we are expending those CPU cycles.