Slow query with aggregations

I have implemented a system of aggregations for search filters and am enduring speed problems with a bit longer queries: there are no problems when a query contains up to 4 words but extremely slow queries onward.

I have used the profiling API, and it turns out that the aggregations are slowing down everything, taking roughly 24 s out of the 25 s required for the search. Here is a link to the gist with the full profiling.

As can be seen from the gist, I am using sampler aggregations, otherwise there are constant timeout errors as searches take way more than 60 seconds.

The mapping for my index is the following:

{
  "movies" : {
    "mappings" : {
      "properties" : {
        "description" : {
          "type" : "text"
        },
        "all_actors" : {
          "type" : "text"
        },
        "episode_title" : {
          "type" : "text"
        },
        "actors_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "series_title" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "language" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "number_of_actors" : {
          "type" : "short"
        },
        "translated_title" : {
          "type" : "text"
        },
        "subject_areas" : {
          "type" : "text"
        },
        "subject_areas_keyword" : {
          "type" : "keyword",
          "ignore_above" : 1000
        },
        "url" : {
          "type" : "text"
        },
        "year" : {
          "type" : "short",
          "null_value" : 0
        }
      }
    }
  }
}

Other data that might be helpful:

  • number of shards: 1
  • number of replicas: 0
  • number of documents in the index: approx. 80 million documents
  • size of the index: approx. 30 GB
  • the fields such as 'actors_keyword' and 'subject_areas' are high-cardinality ones, and I am currently not using the 'eager_global_ordinals' setting

If I turn off the aggregations at all, the search queries work great up until longer queries with over 20 words or so.

The question is: how can I improve the current situation? I am a novice, so I might be missing even some apparently evident details.

If you look at the profile output searches for common terms like of and and take most of the time (as you would expect) because they match many many docs.
You could choose to add these to a "stop word" list to remove them from the index when adding docs or make the query stricter - requiring more of the search terms to match rather than accepting any doc with the word and as a match.

Thanks.

As adding stop words would require rebuilding the whoe indices, I've just tried running the same query by eliminating the stop words from search in python. I've also added a minimum_should_match of 65%.

The queries with many words indeed became much quicker, they are almost instantaneous. On the contrary, the searches for smaller queries are rather slow and take around 4 s.

Could you please take a look at another gist with the updated profiling for a shorter query to help me understand what could be improved yet?

P.S. Don't you mind if I prepare an answer to my own question on StackOverflow crediting your post? I guess I'm not the only one having such questions.

The queries with many words indeed became much quicker, they are almost instantaneous. On the contrary, the searches for smaller queries are rather slow and take around 4 s.

This is likely just because long queries with 65% min_should_match are quite selective (matching few docs) but a search for one or two words is going to match many more docs.
More matching docs = more disk seeks to look up doc values used in aggregations = more time.
Do you have total match counts to compare? Are search response times linear with number of matching docs?

P.S. Don't you mind if I prepare an answer to my own question on StackOverflow crediting your post? I guess I'm not the only one having such questions.

Sure. Thanks for sharing.

I don't know how to measure the total match count. I've tried removing the limits for the shard size, but am running out of memory and still can't get the match count.

From what I see, 'French language' and 'mammal ferrets' run roughly at the same speed (while we would expect the 'mammal ferrets' query to be quite quicker I suppose).

Maybe this is where the aggregations are indeed to heave due to the high cardinality and I should set eager_global_ordinals to improve performance?

It should be returned in the JSON of results under hits/total/value. This count may be capped at 10k unless:
a) You disable track_total_hits or
b) your request has aggregations, in which case all matches must be visited and the count is full

Maybe this is where the aggregations are indeed to heave due to the high cardinality and I should set eager_global_ordinals to improve performance?

You may also want to experiment with the map execution hint

The total hit count for the last query is around 516,000 documents. Strangely enough, the execution time is now around 0.5 s, with all the aggregations and without any additional modification on my part.

I was also able to reveal that the problem is indeed with the high-cardinality fields. If I raise the shard_size parameters from 300 to 3,000 for these fields, the search becomes extremely slow, while a shard_size of 100,000 for low-cardinality fields doesn't seem to affect the speed.

Thank you very much for your help.

UPD. Maybe (I'm not sure) the quicker speed for the last query is due to the fact that I increased the JVM heap size twice.

Elasticsearch and file-system caches are likely behind this.