Hi,
I am currently trying to do 2 levels of aggregations on my documents. You can imagine that my documents are structured as such:
{
"country": "france",
"city": "lyon",
"address": "1 Rue de la Republique"
}
I want to group my searches first by country, then by city. I want the top 5 countries (ordered by score), and for each country I want the top 4 results for each city (again ordered by score), and then I want to take the best result in that city. I have planned the request:
terms aggregation by country (size: 5) > terms aggregation by city (size: 4) > top_hits (size: 1)
I have structured my query like so:
{
"query": "... (query based on user input) ...",
"aggs": {
"by_country": {
"terms": {
"field": "country",
"size": 5,
"order": {
"max_score": "desc"
}
},
"aggs": {
"max_score": { "max": { "script": "_score" } },
"by_city": {
"terms": {
"field": "country",
"size": 4,
"order": {
"max_score": "desc"
}
},
"aggs": {
"max_score": { "max": { "script": "_score" } },
"best_result": {
"top_hits": {
"_source": { "_includes": [ "... some fields ..." ] },
"highlight": { "... highlight ..." },
"size": 1,
"track_scores": true
}
}
}
}
}
}
}
This query does what I want, but the performance takes a huge hit when I have the 3 levels of terms > terms > top_hits.
I have tried several variations (terms > top_hits, terms > terms). I've tried to put the top_hits alongside either of the terms aggregations. All of these work fine, but as soon as I nest 3 levels, the performance is 3 to 4 times slower.
What I don't understand is because of my sizes I am guaranteed to have a maximum of 20 documents in total hitting the top_hits query (5*4) so I don't understand why that takes so long?
I think that by removing the top_hits altogether and seeing that the performance is fine, kind of gives the impression that the 2-level aggregation is not a performance hit, but when I want to get the top_hits of the 2nd-level aggregation it falls over.
I've tried:
- Removing the highlight, which improved things (performance was only 2/3 times slower).
- Removing the _source from top_hits, which did nothing.
- Using stored: true fields for "country" and "city", which did nothing.
- Using collect_mode: breadth, which did nothing, which makes sense because I don't have many buckets but could have lots of documents.
Is there possibly a bug or something I'm not understanding in the top_hits algorithm where it's retrieving more documents that it's supposed to? Or something else slowing it down.
If someone more familiar with the code wants to point me in the right direction I'm happy to take a look too.
I'm on ElasticSearch 5.5
Thanks.