Nested aggregation slows query from ~400ms to 10s

Hello everyone,

I created a simple query which looks for specific 5 car ID's and aggregates upon them. It's extremely fast, takes about 300-500ms. But if I add a nested aggregation which looks only for a specific colour in those 5 ID's, it suddenly takes 10 seconds or longer. This doesn't make sense to me, anyone got an explanation? Sample query down below:

{
    "stored_fields": [
        ...
    ],
    "query": {
        "constant_score": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "term": {
                                "typeId": 8
                            }
                        },
                        {
                            "terms": {
                                "carId": [
                                    2796963847,
                                    2550869,
                                    152794103,
                                    7678419,
                                    338
                                ]
                            }
                        }
                    ]
                }
            }
        }
    },
    "aggs": {
        "cars": {
            "terms": {
                "field": "carId"
            },
            "size": 25,
            "aggs": {
                "colors": {
                    "terms": {
                        "field": "colourId",
                        "include": [
                            534278
                        ]
                    }
                }
            }
        }
    },
    "size": 0
}

300-500ms for a simple unscored filter is an order of magnitude slower than I would expect. When you do the query without the aggs, how many hits does it report? If it's hundreds of millions across many shards, well, that could take some time to aggregate over.

I haven't used the Profile API much, but maybe it can shed some light on it.

Mostly varies, can be lower at times. Without the aggs, it has 78 hits, but the database is huge.

Used Search Profiler in Kibana already, it looks something like this:

PS: The response is pre-cached already, so it's down to 550ms.
carId=keywordId

I can't fathom how aggregating over 78 hits would take 10 seconds.

I'd be curious to see what the underlying OS's are doing while this query is laboring for 10 seconds. Specifically, is there I/O wait with low CPU?

Just guessing here as I don't know what version you are running, what the mappings look like, how many data nodes there are, how many shards are in the index, how big the largest shards are, how big your JVM heap is, what kinds of storage you are using, if you have changed the default configuration in any way, etc.

Try the terms agg with an ‘execution_hint’ of ‘map’

Yeah, me neither, extremely odd. Any other query so far, even nested aggregations, has been supremely fast, even when aggregating on all the data. Yet, on only 78 hits all hell breaks lose lol.

No I/O wait with low CPU, Running v6.3.2.
All fields are just keyword type, except typeId which is long.
The cluster is huge, not sure how many data nodes or shards there are, but I'm pretty confident they aren't the problem :grin:

Thanks for the hint!

Just tried combinations on both term aggregations, unfortunately speed stays the same (or even slows down in some cases a little bit).

Crazy docs with millions of colour terms?

If you run the 10s agg query several times in a row, does it always take 10s?

Also, how confident are you that none of your nodes are swapping?

One colour per doc, but milions of docs :smiley:

Nope, it reduces the speed down to a few hundred ms.
Pretty confident there isn't much problems with the architecture.

Noticed that the slowdown is caused by 'typeId: 8' field because it's an integer.
ES somehow does two operations: compares a range of numbers >8 and <8 instead of just =8.
Guess I should convert the field to a keyword type.. or are there any other ideas without the need for reindexing?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.