Hints to improve performance for numerous aggregations with high cardinalities

ivanqwam · November 7, 2018, 3:08pm

Searching with numerous agregations, each with high cardinalities, can be very slow (10-15s).
Kibana dashboards involving those same aggregations are slow too (over 10s), but they contain multiple visualizations.
Could you help us by providing some hints to improve the situation ?

I know that this is a very old debated problem. All the posts I could find are old. We are using Elasticsearch 6.2 and I think that an update would be useful.

Field cache seems to be involved since the first query is far longer (10-20s) that the following ones (with then seem to be randomly long 7s to short 3s).
We tried different methods to warm up the index with not much success.

We are editors for an annotator performing names entities recognition.
These named entities are transformed in fields.
These fields provides numerous aggregations/facets, each with high cardinalities.

Here is the JSON query asking for facets:

{"from":0,"size":50,"query":{"query_string":{"query":"( tout:(internet) )","default_field":"tout","fields":[],"type":"best_fields","default_operator":"and","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},"explain":true,"stored_fields":["Site_url","Site_nom","ETAT","Article_titre","Article_date_edition","Article_description","Article_url","DATE_COLLECT","ID_QES","BASE_ID","LOGIN_USER_COLLECT","Article_langue"],"sort":[{"DATE_COLLECT":{"order":"desc"}}],"aggregations":{"Site_nom.verbatim":{"terms":{"field":"Site_nom.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"Article_langue.verbatim":{"terms":{"field":"Article_langue.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"Article_categories.verbatim":{"terms":{"field":"Article_categories.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"DATE_COLLECT":{"date_histogram":{"field":"DATE_COLLECT","interval":"1d","offset":0,"order":{"_key":"asc"},"keyed":false,"min_doc_count":0}},"QES_Person.verbatim":{"terms":{"field":"QES_Person.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Persontitre.verbatim":{"terms":{"field":"QES_Persontitre.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Company.verbatim":{"terms":{"field":"QES_Company.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Organization.verbatim":{"terms":{"field":"QES_Organization.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Organonoff.verbatim":{"terms":{"field":"QES_Organonoff.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Location.verbatim":{"terms":{"field":"QES_Location.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Unlocalized.verbatim":{"terms":{"field":"QES_Unlocalized.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Event.verbatim":{"terms":{"field":"QES_Event.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_ConceptCategorized.verbatim":{"terms":{"field":"QES_ConceptCategorized.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Coldconcept.verbatim":{"terms":{"field":"QES_Coldconcept.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Country.verbatim":{"terms":{"field":"QES_Country.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Region.verbatim":{"terms":{"field":"QES_Region.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_City.verbatim":{"terms":{"field":"QES_City.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}},"QES_Product.verbatim":{"terms":{"field":"QES_Product.verbatim","size":20,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}

As you can see they are numerous.

Here is our configuration:

56 Millions documents
3 indexes 2016, 2017, 2018, each with 10 shards,
on 5 nodes.

Each nodes is :

Physical machines DELL R430
Linux Debian Stretch
Elasticsearch 6.2
12 CPU/24 threads,
64 GB RAM
4 x 2TB SATA disk in RAID5
JVM with -Xmx=32 GB

I know that going to SSD disk and adding more nodes would improve things but I think that there must be some other tracks to follow before that.

Could you help us on this ?

dadoonet · November 28, 2018, 12:01pm

@jimczi or @jpountz do you think you could help?

jpountz · December 21, 2018, 10:01am

This isn't Lucene's FieldCache (a structure that stored an uninverted index before getting replaced by doc values in Lucene 4.0) but there are some caches involved indeed:

there is a cache for ordinal maps that map segment ordinals to global ordinals (only keyword and ip fields). It can be beneficial to compute these ordinals maps at refresh time rathen than query time thanks to the eager_global_ordinals option: eager_global_ordinals | Elasticsearch Guide [6.5] | Elastic
as with most data bases, the filesystem cache plays an important role. If data was cold on the first request, then it's expected to run slower than follow-up requests, since frequently accessed data is already in the filesystem cache for these follow-up requests
and finally Elasticsearch maintains a request cache, but it's only enabled by default when size = 0 so I don't think this explains what you are seeing.

If eager_global_ordinals don't make the situation better, could you share the number of hits and the output of the nodes hot threads API while the request is running? Nodes hot_threads | Elasticsearch Guide [6.5] | Elastic

ivanqwam · January 2, 2019, 9:42am

Happy new year Adrian,

Thank you for your explanations. This update will be very useful to many.

We did the following:

added eager_global_ordinals in our mapping to every "keyword" columns liable to be used in Term aggregation.
Destroyed the index
Full reindexing

On the first queries, things did not change much, but after some warming, the performances are much better with a decent response time of 3 seconds for most queries.

Definitely, eager_global_ordinals improves things !

For big number of hits (1 403 062 hits which is close to the whole database) it can still be very slow with a response time close to 10s. But with this number of hits I can understand that.

Following your suggestion, here is a "hot threads" command for one of them while running, if you see something else ....

You will find them following this link

I let it to your judgment to analyse it since I consider that you already answered my question.

jpountz · January 2, 2019, 10:01am

Hello Ivan, happy new year to you too!

Sorry that I didn't mention this explicitly, but reindexing was not required to enable eager global ordinals, they can be enabled on an existing index.

I just looked at the hot threads that you shared and everything looks fine. This is what I'd expect when running terms aggregations on large numbers of documents.

We try hard to come up with good defaults and most of the time there is very little to tune. You can look at Tune for search speed | Elasticsearch Guide [6.5] | Elastic for more information.

For the record, it's expected that not all requests take similar times. In particular response times tend to vary a lot depending on the number of documents that the query matches: the more matches, the longer the aggregation.

ivanqwam · January 2, 2019, 10:11am

Thank you Adrien for this comprehensive response enlighting all aspect of our issue and summarizing all the good pratices on 6.5 version.

Your response was so fast that I modified my post a liitle, by tempering our results on big number of hits, and our exchanges crossed.

I guess this will help many.

Please close the case.

system · January 30, 2019, 10:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow aggregations query Elasticsearch	1	465	August 28, 2017
Slow terms aggregations after use of eager_global_ordinals Elasticsearch	6	824	November 9, 2020
High cardinality field term agg results in slow query regardless of hit count Elasticsearch	5	2097	July 5, 2017
Slow terms aggregations Elasticsearch	14	6006	July 5, 2017
Slow Terms Aggregation Elasticsearch	4	1432	May 10, 2019

Hints to improve performance for numerous aggregations with high cardinalities

Related topics