Hints to improve performance for numerous aggregations with high cardinalities

Searching with numerous agregations, each with high cardinalities, can be very slow (10-15s).
Kibana dashboards involving those same aggregations are slow too (over 10s), but they contain multiple visualizations.
Could you help us by providing some hints to improve the situation ?

I know that this is a very old debated problem. All the posts I could find are old. We are using Elasticsearch 6.2 and I think that an update would be useful.

Field cache seems to be involved since the first query is far longer (10-20s) that the following ones (with then seem to be randomly long 7s to short 3s).
We tried different methods to warm up the index with not much success.

We are editors for an annotator performing names entities recognition.
These named entities are transformed in fields.
These fields provides numerous aggregations/facets, each with high cardinalities.

Here is the JSON query asking for facets:

{"from":0,"size":50,"query":{"query_string":{"query":"( tout:(internet) )","default_field":"tout","fields":[],"type":"best_fields","default_operator":"and","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},"explain":true,"stored_fields":["Site_url","Site_nom","ETAT","Article_titre","Article_date_edition","Article_description","Article_url","DATE_COLLECT","ID_QES","BASE_ID","LOGIN_USER_COLLECT","Article_langue"],"sort":[{"DATE_COLLECT":{"order":"desc"}}],"aggregations":{"Site_nom.verbatim":{"terms":{"field":"Site_nom.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"Article_langue.verbatim":{"terms":{"field":"Article_langue.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"Article_categories.verbatim":{"terms":{"field":"Article_categories.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"DATE_COLLECT":{"date_histogram":{"field":"DATE_COLLECT","interval":"1d","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0}},"QES_Person.verbatim":{"terms":{"field":"QES_Person.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Persontitre.verbatim":{"terms":{"field":"QES_Persontitre.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Company.verbatim":{"terms":{"field":"QES_Company.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Organization.verbatim":{"terms":{"field":"QES_Organization.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Organonoff.verbatim":{"terms":{"field":"QES_Organonoff.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Location.verbatim":{"terms":{"field":"QES_Location.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Unlocalized.verbatim":{"terms":{"field":"QES_Unlocalized.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Event.verbatim":{"terms":{"field":"QES_Event.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_ConceptCategorized.verbatim":{"terms":{"field":"QES_ConceptCategorized.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Coldconcept.verbatim":{"terms":{"field":"QES_Coldconcept.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Country.verbatim":{"terms":{"field":"QES_Country.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Region.verbatim":{"terms":{"field":"QES_Region.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_City.verbatim":{"terms":{"field":"QES_City.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Product.verbatim":{"terms":{"field":"QES_Product.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}

As you can see they are numerous.

Here is our configuration:

  • 56 Millions documents
  • 3 indexes 2016, 2017, 2018, each with 10 shards,
  • on 5 nodes.

Each nodes is :

  • Physical machines DELL R430
  • Linux Debian Stretch
  • Elasticsearch 6.2
  • 12 CPU/24 threads,
  • 64 GB RAM
  • 4 x 2TB SATA disk in RAID5
  • JVM with -Xmx=32 GB

I know that going to SSD disk and adding more nodes would improve things but I think that there must be some other tracks to follow before that.

Could you help us on this ?

@jimczi or @jpountz do you think you could help?

This isn't Lucene's FieldCache (a structure that stored an uninverted index before getting replaced by doc values in Lucene 4.0) but there are some caches involved indeed:

  • there is a cache for ordinal maps that map segment ordinals to global ordinals (only keyword and ip fields). It can be beneficial to compute these ordinals maps at refresh time rathen than query time thanks to the eager_global_ordinals option: eager_global_ordinals | Elasticsearch Guide [6.5] | Elastic
  • as with most data bases, the filesystem cache plays an important role. If data was cold on the first request, then it's expected to run slower than follow-up requests, since frequently accessed data is already in the filesystem cache for these follow-up requests
  • and finally Elasticsearch maintains a request cache, but it's only enabled by default when size = 0 so I don't think this explains what you are seeing.

If eager_global_ordinals don't make the situation better, could you share the number of hits and the output of the nodes hot threads API while the request is running? Nodes hot_threads | Elasticsearch Guide [6.5] | Elastic

1 Like

Happy new year Adrian,

Thank you for your explanations. This update will be very useful to many.

We did the following:

  • added eager_global_ordinals in our mapping to every "keyword" columns liable to be used in Term aggregation.
  • Destroyed the index
  • Full reindexing

On the first queries, things did not change much, but after some warming, the performances are much better with a decent response time of 3 seconds for most queries.

Definitely, eager_global_ordinals improves things !

For big number of hits (1 403 062 hits which is close to the whole database) it can still be very slow with a response time close to 10s. But with this number of hits I can understand that.

Following your suggestion, here is a "hot threads" command for one of them while running, if you see something else ....

You will find them following this link

I let it to your judgment to analyse it since I consider that you already answered my question.

Hello Ivan, happy new year to you too!

Sorry that I didn't mention this explicitly, but reindexing was not required to enable eager global ordinals, they can be enabled on an existing index.

I just looked at the hot threads that you shared and everything looks fine. This is what I'd expect when running terms aggregations on large numbers of documents.

We try hard to come up with good defaults and most of the time there is very little to tune. You can look at Tune for search speed | Elasticsearch Guide [6.5] | Elastic for more information.

For the record, it's expected that not all requests take similar times. In particular response times tend to vary a lot depending on the number of documents that the query matches: the more matches, the longer the aggregation.

Thank you Adrien for this comprehensive response enlighting all aspect of our issue and summarizing all the good pratices on 6.5 version.

Your response was so fast that I modified my post a liitle, by tempering our results on big number of hits, and our exchanges crossed.

I guess this will help many.

Please close the case.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.