Elasticsearch: Slow query with min_doc_count=0 on field aggregation

fherenius · July 15, 2024, 2:22pm

For more background, see the previous topic on this issue: Elasticsearch queue issue after upgrading from 8.6.2 to 8.12.1/8.12.2 - #2 by Amos66

Specifically the replies by Amos66.

As described in the linked post, the query issued to Elasticsearch by Grafana contains min_doc_count = 0 on the terms aggregation over log levels.

It appears that since a recent version of Elasticsearch, this query has become excruciatingly slow and will timeout most of the time.

A potential fix has been implemented: Disable parallel collection for terms aggregation with min_doc_count equals to 0 by iverase · Pull Request #106156 · elastic/elasticsearch · GitHub

However, even after upgrading to Elasticsearch 8.13.4, which should include the fix, the query is still as slow as before.

I'm unsure where to start debugging this. If we change the min_doc_count = 1, the query will succeed within a second instead of 30 seconds.

For reference, here is the query:

{
    "size": 0,
    "query": {
        "bool": {
            "filter": [{
                    "range": {
                        "@timestamp": {
                            "gte": 1710050816992,
                            "lte": 1710051116992,
                            "format": "epoch_millis"
                        }
                    }
                }, {
                    "query_string": {
                        "analyze_wildcard": true,
                        "query": "***"
                    }
                }
            ]
        }
    },
    "aggs": {
        "2": {
            "terms": {
                "field": "***.keyword",
                "size": 500,
                "order": {
                    "_key": "asc"
                },
                "min_doc_count": 0
            },
            "aggs": {
                "3": {
                    "date_histogram": {
                        "field": "@timestamp",
                        "min_doc_count": "0",
                        "extended_bounds": {
                            "min": 1710050816992,
                            "max": 1710051116992
                        },
                        "format": "epoch_millis",
                        "fixed_interval": "1m"
                    },
                    "aggs": {}
                }
            }
        }
    }
}

Ignacio_Vera · July 16, 2024, 7:11am

Could you share the hot threads while executing the query?

fherenius · July 16, 2024, 9:03am

Of course! See: Hot threads for ES query performed by Grafana · GitHub

What I noticed is that logging-09 and logging-05 are the nodes with 100% CPU usage (Before 8.13.4, all nodes would end up on 100% CPU).

Logging-09 and logging-05 are cold tier nodes.

Ignacio_Vera · July 16, 2024, 10:14am

~~the hot threads you shared refers to a terms aggregation, but the query you shared above refers to a date_histogram aggregation. Are you experience a slow down in both cases?~~

I see the terms aggregation now

Ignacio_Vera · July 16, 2024, 10:49am

Just to set expectations.
Is your case a case where the query used to much faster than it is now or just always being slow?

There has been no changes in this area, so I am not expecting a sudden slow down.

fherenius · July 16, 2024, 11:22am

I used to be much faster, the slowdown happened after the upgrade from 7.17.7 to 8.12.2.

We since upgraded to 8.13.4 due to the changes mentioned in the PR.

If you do not expect a sudden slow down, it can also be our cluster configuration (especially on the cold tier nodes, which seem to suffer most during this query). I'm not experienced enough in Elasticsearch to know if the slow down (200ms to 30s) can be expected when min_doc_count = 1 changes to min_doc_count = 1.

For reference, it's around a 30TB cluster with 20.000.000.000 documents. Most stored on the cold tier nodes.

Ignacio_Vera · July 16, 2024, 11:42am

The difference in performance between min_doc_count = 1 and min_doc_count = 0 can be huge.

In the first case, we only need to collect values from the documents matching the query in order to build the aggregation.

In the second case, we are visiting all documents in order to collect all terms on the index to report tha terms that din't match the query. I think it is possible to improve this situation in some cases.

So for what you are reporting, my first suspicion is that your cold tier has become slower to collect those documents with the upgrade but I don't find any suspects that can cause that slow down.

Ignacio_Vera · July 16, 2024, 1:14pm

I open Speed up collecting zero document string terms by iverase · Pull Request #110922 · elastic/elasticsearch · GitHub. I beleive this should speed up your use case as it will be using ordinals to collect zero terms instead of visitiong each document.

fherenius · July 16, 2024, 1:43pm

Ah, really nice! Looking forward to seeing the changes in performance.

Thanks a lot! If the topic is still open by the time we upgrade - I'll report back to let you know the difference in query speed.

Topic		Replies	Views
ES Aggregation (Bug?) - No buckets results at high "min_doc_count" and low "size" Elasticsearch	2	595	September 19, 2017
Questions about aggregation min_doc_count = 0 Elasticsearch	3	1803	July 6, 2017
Elasticsearch aggregation query ignores parent filter when setting min_doc_count to 0 in child terms aggregation Elasticsearch	1	746	April 25, 2019
Min doc sub aggregation (find duplicates) Elasticsearch	1	483	October 7, 2017
Min_doc_count combined with order by average Elasticsearch	1	349	May 20, 2020

Elasticsearch: Slow query with min_doc_count=0 on field aggregation

Related topics