Slow terms aggregations

Hi,

I have a problem with my ES deployment. My queries are using a few terms aggregations at once, and I have extremely low performance of queries (~8s for one query).

Cluster:

  • Two machines (each with two CPU cores and 13GB of memory)
  • ES has 6 GB of memory available
  • SSD storage
  • ~9M documents

My aggregations are basically facets for products catalog, so I have high cardinality fields (brands, categories, etc).
I tried several configuration improvements (reducing number of shards, increasing memory, tuning cache) but I didn't manage to fix my issue.

When I perform a query with facets, both machines are using their CPUs at 100%. Memory usage and I/O load are low too.

My questions:

  • How to improve queries performance? I don't want to reduce number of used facets
  • Which approach is better for facetting - store unique IDs in documents or whole names (for example for categories)

Regards,
Artur

Just to check, are you using facets or aggs?

I'm using aggs.

Example aggregations from my query:

      "_filter_brand": {
     "aggs": {
        "brand": {
           "terms": {
              "field": "brand"
           }
        }
     },
     "filter": {
        "match_all": {}
     }
  },
  "_filter_categories": {
     "aggs": {
        "categories": {
           "terms": {
              "field": "category_ids",
              "size": 100
           }
        }
     },
     "filter": {
        "match_all": {}
     }
  }

Is there any way, to debug my queries? Explain returns only information why some results appeared as results.
I need information about how long it takes to for example fetch results from another node/shard etc.

You can use the slow log to see the times for each phase:

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html#search-slow-log

It won't tell you why, but you can see how many shards were searched and how long it took.

My query has large result set, so always all shards are involved. I tried to reduce number of shards/replicas and now I have 4 shards without any replicas (2 shards per node). My query time reduced to 7s but it's still unacceptable time for a search query for me.
I tried to split up my query into two queries (one for aggregations with count API and another with standard query) but always first query is very slow.

Any ideas, what should I do to increase the performance?

It's hard to know what the problem could be without seeing your full query (if it's large please paste into a gist rather than pasting it here and provide a link to it). Also, I would try removing each aggregation one at a time and see if the performance improve to determine which aggregation(s) is causing the slow performance

@colings86 here is my query: https://gist.github.com/artursmet/4c4369cdcf40923fc3b2#file-slow_query-json

I tried to remove my aggregations one by one and I saw that my author_ids and category_ids aggregations are the slowest.
They are arrays of integers (ids from another database).
I tried to hash them at index time and there is a little improvement, but still it's not so fast.

As I saw, the queries are using mostly CPU (both nodes have 100% cpu used, when my query is running)

Why are you wrapping every aggregation in a filter aggregation which is a match_all? this is not going to make any difference to the result and only adds overhead to execution.

How many unique author_ids and category_ids are in the index?

This query was generated by elasicsearch-dsl-py library, but sure, it can be reduced.

As I mentioned before, I have a lot of unique author_ids and category_ids there.
Category ids (4200 unique), author_ids (over 2M).
How is the best practice for handling terms aggregations when field has many unique values?

4200 unique category ids is actually a low cardinality, it should execute very quickly. Since the query takes a long time to execute, could you try to use the nodes hot threads API to capture where elasticsearch is spending time while the query is running? Also it would be interesting to know if your fields are mapped as numerics or strings.

Even if I would be surprised if the filters explained the slowness, they might contribute a bit to it so it would be interesting to test without them.

I changed the query and removed filters part, but still there is no visible improvement.
Current query: https://gist.github.com/artursmet/273fa075d97711ea3edd
Categories and authors are mapped as integers:

"author_ids" : {
  "type" : "integer"
}

Brand field is mapped as not_analyzed string.

Hot threads looks interesting, I've never heard of it before. Here is the output from hot threads: https://gist.github.com/artursmet/c97170918f917dbe92f1 but I think that it won't tell me more than I know now. Aggregations are taking most of time.

Things like categories or authors, should be indexed as multi value field (string or numeric) or as nested field with some more data?

The hot threads look sane, I'm a bit puzzled when you are having such slow response times on 9M docs only. If you have some time for experimenting, you might want to try to enable doc values on those fields and/or to map them as strings instead of numerics (which have different execution paths for terms aggs).

@jpountz Thanks for reply.
I tried setting those fields for doc_values, one thing that I didn't check is mapping as strings instead of numeric. But I will check this path for sure.
I found a little work around - I split my query into two queries (facets for search_type=count endpoint and normal hits into standard _search. I also added timeout for my facet query, because my aggregation counters doesn't have to be super accurate.

FYI you should really call them aggs, not facets, otherwise you will confuse people :slight_smile: