Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2

elastic_dude · March 8, 2024, 6:10pm

We have been running on Elasticsearch 8.6.2 for several months now with very few issues other than some query (un)optimizations. We have several clusters duplicated in two regions + two non-prod environments with several clusters.

Upon switching to 8.12.1 (and subsequently 8.12.2 because we saw changelog notes regarding connection issue fixes) in one of our production regions we saw large impact to incoming traffic. This did not happen in our dev environment, either of our stage regions, or even our one low volume canary prod single region deployment. It seems volume is a key ingredient in exposing this issue.

It appears that in 8.10 the concept of 'search_worker' was introduced which if I am understanding correctly deviates from the longstanding behavior of one query per shard at a time to now allowing querying of multiple segments of a shard at a time? I bring this up because the impact to traffic (timeouts, rejected transactions) aligns with a huge spike in search and search_worker thread count as well as queues for each of those pools despite nothin in our logs indicating a spike of incoming traffic. The queues only seem to show up several hours after we deploy and are very difficult to force in lower volume environments. They also seem to resolve themselves but go on a rollercoaster pattern after they start. We have ample CPU cycle headroom, peak usage was only 20% on data nodes at that time, heap has headroom, lots of memory on the nodes, etc.

Has anyone come across this behavior? Any suggestions on what to do or troubleshoot?

The detail I am hesitant to share as in the past people check out here, we do have a custom plugin. This plugin filters the result set based on security permissions it retries from an API call that is highly performant and lightweight, we have been using this plugin for years on at times hundreds of thousands of queries at a time. I bring this up because we have had connection timeouts/issus there before leading to queues but those would not resolve themselves and we would see error logs, this time we do not see any error logs and the queues resolve themselves. We stripped our changes down to 8.12.2 only - no other changes at all - to isolate the issue.

Any help is much appreciated

Amos66 · March 9, 2024, 9:21pm

Same here, upgrade with ECK from 8.10.2 to 8.12.2
All node with high CPU (100%) and huge spike in search and search_worker
No errors in logs. Only timeout due to high CPU.

elastic_dude · March 9, 2024, 9:33pm

Interesting you came from 8.10.2, I believe the search_worker pool had already been introduced by then. Are you seeing queues in your search and search_worker thread pools on 8.12.2 or just high CPU? What did your CPU peak at before the update?

Amos66 · March 9, 2024, 10:14pm

big jump. From ~0 on 8.10.2
to average of 1k search_worker and 160 search.

It is a big cluster.
On a smaller one that I did 2 weeks ago, I can see the history. It was also raising, but went down after an hour from upgrade.

elastic_dude · March 9, 2024, 10:30pm

Got it. That sounds like our situation except we spiked to 1k search queue and a few hundred search_worker queue. Ours happened on our 4th (of 7) biggest cluster but it receives a lot query traffic and has exactly half the VM resources per node (which impacts fixed thread pool defaults) of our biggest 3 clusters where we haven't seen the queue (nor our smallest clusters). Very strange and puts us in a bad situation to try to rebuild from snapshots. We put one region back to 8.6.2 with a manual rebuild and no issues so far.

Amos66 · March 9, 2024, 11:51pm

In our case it seams that Grafana direct query to elastic caused the problem. After disabling the datasource, elastic cpu went down. and no more queues.
We will investigate more tomorrow.

elastic_dude · March 10, 2024, 12:51am

How did you find that? No direct Grafana calls in our use case, we do use Prometheus exporter though

Amos66 · March 10, 2024, 6:32am

We found it by starting to close all active clients.
When I try to run the generated query on Kibana, it is timing out and CPU raise to 100.
I believe you have similar problem with a query that for some reason takes too long to run on 8.12.2
For me it was running well on 8.10.2

We will check with new Grafana, but this will take some time.

elastic_dude · March 10, 2024, 6:55am

Were you at 100% utilization until you turned that off? Our queues resolved themselves within a few minutes (after impacting thousands of transactions) and would show back up again periodically. It is nearly impossible for us to pin the trigger down due to hundreds of different types of queries being run and we can't just turn them off.

DavidTurner · March 10, 2024, 10:31am

What was the query exactly?

That's true, but it's not obviously related - it's doing the same amount of work either way. Instead it sounds like there's been an unexpected regression in the performance of a certain kind of query. It'd be very useful if you could pin it down so we can open a bug report. I expect that enabling the query slow log should help, but eliminating clients one-by-one is a reliable approach too.

DavidTurner · March 10, 2024, 10:33am

Assuming it is just one kind of query which has slowed down, you should be seeing timeouts or slowness in your clients too. That'd be a good way to find the trigger.

Amos66 · March 10, 2024, 10:47am

This is the query. I remove private data.
It was running well on 8.10.2. Not working on 8.12.2. I use ECK with 20Node 10Cpu per pod
When I run it in Kibana, I get timeout.

{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "***"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field": "***.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

DavidTurner · March 10, 2024, 11:03am

I think without seeing more detail here it's not going to be possible to pin this down.

Amos66 · March 10, 2024, 11:37am

It is not really relevant to this problem. Anyway I did modify the query to be more simple and it still failing.

POST applogs-*/_search 
{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "log_info.keywords:send_cmd"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field":"log_info.source_machine_name.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

DavidTurner · March 10, 2024, 11:59am

Thanks. That's pretty simple, I'm surprised there's been a regression here. Can you simplify it any further or is this a minimal example? E.g. what if you remove the inner agg? Or the extended_bounds parameter? Or replace the query_string filter with an equivalent term one?

Once you've got this as small as possible, could you share a profile from both 8.10 (where it's fast) and 8.12 (where it's slow)?

Amos66 · March 10, 2024, 12:16pm

I do not have 8.10.2 cluster anymore.
All clusters been upgraded to 8.12. I have old one 7.6 that I can test.

But I believe if you change the "query" and the "terms.field" you will be able to reproduce on any index.

I did similar test on Discover, and it works fine, but the generated query is different.

I will try to minimize it when I will have some spare time.

Amos66 · March 10, 2024, 12:35pm

OK, Found the query problem.

   "aggs": {
        "2": {
            "terms": {
                "field": "log_info.source_machine_name.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
         ==>>  "min_doc_count": 0
            },

When I change "min_doc_count": "0" to "min_doc_count": "1"
It started to work fine.

DavidTurner · March 10, 2024, 1:16pm

Hm interesting. I don't have a dataset to hand on which I can reproduce this. Could you share a profile with min_doc_count: 0 and again with 1?

Amos66 · March 10, 2024, 2:50pm

the query with 0 fail. So no profiling.
For me the problems is solved. I have a workaround.

With other index I have similar problem. 6sec vs 44ms.
I am sure it can be reproduce with any large index.

When I look into the result, I see that with min_doc_count:0 i get all bucket with 0 count. Meaning I get much more buckets. And I believe it make sense.
The question why it is opening so many search threads on 8.12.2 comparing to 8.10.2 is unknown for me, and I cannot test it as I do not have 8.10.2 cluster.

Ignacio_Vera · March 10, 2024, 3:34pm

I wonder if it would be possible to share the output of the nodes hot threads while running the query with min_doc_count: 0. It should tell us where we are expending those CPU cycles.

Topic		Replies	Views
Troubleshooting cluster wide performance slow downs Elasticsearch	1	430	July 6, 2017
Queue size Elasticsearch	6	646	July 6, 2017
Elasticsearch Performance Issues Upgrading v6.2.3 to v7.13.1 Elasticsearch	3	636	July 13, 2021
Production Cluster Suddenly Crashed Last Night Elasticsearch	5	1652	July 6, 2017
Increasing thread pool / queue size Elasticsearch	3	615	July 6, 2017

Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2

Related topics