"top_hits" performance inside 2 levels of "terms" aggregations

emmerich · July 7, 2017, 12:40pm

Hi,

I am currently trying to do 2 levels of aggregations on my documents. You can imagine that my documents are structured as such:

{
    "country": "france",
    "city": "lyon",
    "address": "1 Rue de la Republique"
}

I want to group my searches first by country, then by city. I want the top 5 countries (ordered by score), and for each country I want the top 4 results for each city (again ordered by score), and then I want to take the best result in that city. I have planned the request:

terms aggregation by country (size: 5) > terms aggregation by city (size: 4) > top_hits (size: 1)

I have structured my query like so:

{
    "query": "... (query based on user input) ...",
    "aggs": {
        "by_country": {
            "terms": {
                "field": "country",
                "size": 5,
                "order": {
                    "max_score": "desc"
                 }
            },
            "aggs": {
                 "max_score": { "max": { "script": "_score" } },
                 "by_city": {
                     "terms": {
                         "field": "country",
                         "size": 4,
                         "order": {
                             "max_score": "desc"
                         }
                     },
                     "aggs": {
                         "max_score": { "max": { "script": "_score" } },
                         "best_result": {
                             "top_hits": {
                                 "_source": { "_includes": [ "... some fields ..." ] },
                                 "highlight": { "... highlight ..." },
                                 "size": 1,
                                 "track_scores": true
                             }
                         }
                     }
                }
             }
        }
    }

This query does what I want, but the performance takes a huge hit when I have the 3 levels of terms > terms > top_hits.

I have tried several variations (terms > top_hits, terms > terms). I've tried to put the top_hits alongside either of the terms aggregations. All of these work fine, but as soon as I nest 3 levels, the performance is 3 to 4 times slower.

What I don't understand is because of my sizes I am guaranteed to have a maximum of 20 documents in total hitting the top_hits query (5*4) so I don't understand why that takes so long?

I think that by removing the top_hits altogether and seeing that the performance is fine, kind of gives the impression that the 2-level aggregation is not a performance hit, but when I want to get the top_hits of the 2nd-level aggregation it falls over.

I've tried:

Removing the highlight, which improved things (performance was only 2/3 times slower).
Removing the _source from top_hits, which did nothing.
Using stored: true fields for "country" and "city", which did nothing.
Using collect_mode: breadth, which did nothing, which makes sense because I don't have many buckets but could have lots of documents.

Is there possibly a bug or something I'm not understanding in the top_hits algorithm where it's retrieving more documents that it's supposed to? Or something else slowing it down.

If someone more familiar with the code wants to point me in the right direction I'm happy to take a look too.

I'm on ElasticSearch 5.5

Thanks.

jimczi · July 7, 2017, 12:58pm

Are you saying that removing highlighting made the query 2/3 times faster ? If so what's the average time response for your query ? I am asking this because if highlighting 5*4 documents takes most of the time then the aggregation is not the problem here. What is the size of your document ? Are you trying to return huge documents from ES ?
The breadth_first mode is automatically picked in 5.x for aggs with nested terms aggregations like yours, this is the most efficient way to perform such aggregation trees so I don't see how you could optimize this more.

emmerich · July 7, 2017, 1:19pm

Yes, removing highlighting did improve things, but not to the speed of removing one of the three levels (terms1, terms2 or top_hits).

To give an idea of speed (200ms is what we had before I started nesting terms):
terms1 > top_hits : ~200ms
terms2 > top_hits: ~200ms
terms1 > terms2: ~300ms
terms1 > terms2 > top_hits: ~1300ms
terms1 > terms2 > top_hits (no highlight): ~700ms
terms1 > terms2 > top_hits (no _source): ~1300ms
terms1 > terms2 > top_hits (no _source, no highlight): ~700ms

The documents in ElasticSearch are not huge huge, and I have a _source filter applied in the top_hits aggregation to reduce the fields I'm getting back. Document size (non-filtered) would be about.. 4KB saved on disk in JSON format.

I have also tried using the "field collapsing" detailed here: https://www.elastic.co/guide/en/elasticsearch/reference/5.x/search-request-collapse.html to replace one of the terms aggs, but as the result of field collapsing is ignored for aggregations, it wasn't taken into account for the 2nd collapsing. Nested field collapsing isn't supported so I couldn't try to put all of my collapsing into that either.

emmerich · July 7, 2017, 1:25pm

I should note that highlighting is something that we've had in many previous versions without performance issues. The only change between the old version and this one is the added level of terms aggregation. So although removing highlighting might give benefits, I feel like it's not the root of the problem.

emmerich · July 7, 2017, 1:30pm

The example I gave in the first post was simplified to identify the problem. Here is my actual aggs request:

I have the following fields:

document_name.keyword => type: keyword, store: true (used for 1st terms aggregation)
path => type: keyword, store: true (used for 2nd terms aggregation)
textContents => type: text (used for highlight)

"aggs": {
      "by_document": {
        "terms": {
          "field": "document_name.keyword",
          "size": 5,
          "shard_size": 50,
          "order": {
            "max_score": "desc"
          }
        },
        "aggs": {
          "max_score": {
            "max": {
              "script": "_score"
            }
          },
          "by_menu": {
            "terms": {
              "field": "path",
              "size": 4,
              "order": {
                "max_score": "desc"
              }
            },
            "aggs": {
              "max_score": {
                "max": {
                  "script": "_score"
                }
              },
              "hits": {
                "top_hits": {
                  "_source": true,
                  "highlight": {
                    "order": "score",
                    "fields": {
                      "textContents": {
                        "fragment_size": 160,
                        "number_of_fragments": 1,
                        "no_match_size": 160
                      }
                    }
                  },
                  "size": 1,
                  "track_scores": true,
                  "explain": false
                }
              }
            }
          }
        }
   }
}

emmerich · July 7, 2017, 1:57pm

Here's an example output (content information omitted with ...):

gist.github.com

https://gist.github.com/emmerich/9ccda84976dc25f5eb782743692bafce

example_output.json

{
  "took": 1896,
  "timed_out": false,
  "_shards": {
    "total": 12,
    "successful": 12,
    "failed": 0
  },
  "hits": {
    "total": 7190,

This file has been truncated. show original

system · August 4, 2017, 1:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Top hits aggregation performance issue Elasticsearch	1	953	January 19, 2017
Aggregation over aggregation on another field + top_hits Elasticsearch	2	503	November 4, 2022
Return the top k hits from each bucket after aggregation? Elasticsearch	3	483	February 27, 2021
Sub aggregating top_hits Elasticsearch	7	349	February 20, 2024
Top N documents from top_hits, rather than top N per bucket Elasticsearch	1	886	July 5, 2017

"top_hits" performance inside 2 levels of "terms" aggregations

Related topics