Elasticsearch: Slow query with min_doc_count=0 on field aggregation

For more background, see the previous topic on this issue: Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2 - #2 by Amos66

Specifically the replies by Amos66.

As described in the linked post, the query issued to Elasticsearch by Grafana contains min_doc_count = 0 on the terms aggregation over log levels.

It appears that since a recent version of Elasticsearch, this query has become excruciatingly slow and will timeout most of the time.

A potential fix has been implemented: Disable parallel collection for terms aggregation with min_doc_count equals to 0 by iverase · Pull Request #106156 · elastic/elasticsearch · GitHub

However, even after upgrading to Elasticsearch 8.13.4, which should include the fix, the query is still as slow as before.

I'm unsure where to start debugging this. If we change the min_doc_count = 1, the query will succeed within a second instead of 30 seconds.

For reference, here is the query:

{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "***"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field": "***.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

Could you share the hot threads while executing the query?

Of course! See: Hot threads for ES query performed by Grafana · GitHub

What I noticed is that logging-09 and logging-05 are the nodes with 100% CPU usage (Before 8.13.4, all nodes would end up on 100% CPU).

Logging-09 and logging-05 are cold tier nodes.

the hot threads you shared refers to a terms aggregation, but the query you shared above refers to a date_histogram aggregation. Are you experience a slow down in both cases?

I see the terms aggregation now

Just to set expectations.
Is your case a case where the query used to much faster than it is now or just always being slow?

There has been no changes in this area, so I am not expecting a sudden slow down.

I used to be much faster, the slowdown happened after the upgrade from 7.17.7 to 8.12.2.

We since upgraded to 8.13.4 due to the changes mentioned in the PR.

If you do not expect a sudden slow down, it can also be our cluster configuration (especially on the cold tier nodes, which seem to suffer most during this query). I'm not experienced enough in Elasticsearch to know if the slow down (200ms to 30s) can be expected when min_doc_count = 1 changes to min_doc_count = 1.

For reference, it's around a 30TB cluster with 20.000.000.000 documents. Most stored on the cold tier nodes.

The difference in performance between min_doc_count = 1 and min_doc_count = 0 can be huge.

In the first case, we only need to collect values from the documents matching the query in order to build the aggregation.

In the second case, we are visiting all documents in order to collect all terms on the index to report tha terms that din't match the query. I think it is possible to improve this situation in some cases.

So for what you are reporting, my first suspicion is that your cold tier has become slower to collect those documents with the upgrade but I don't find any suspects that can cause that slow down.

I open Speed up collecting zero document string terms by iverase · Pull Request #110922 · elastic/elasticsearch · GitHub. I beleive this should speed up your use case as it will be using ordinals to collect zero terms instead of visitiong each document.

Ah, really nice! Looking forward to seeing the changes in performance.

Thanks a lot! If the topic is still open by the time we upgrade - I'll report back to let you know the difference in query speed.