Unique Term Values via Aggregation - Performance Considerations

mikeydoyle · December 20, 2016, 6:10pm

I'm looking to get a distinct list of values for a few terms ("keyword" fields) in my index for display in a UI. I would consider these to be low cardinality (< 200 distinct values, for example). To do this, I have been experimenting with the aggregations API.

e.g.)

{ "aggs" : { "roles" : { "terms" : { "field" : "role", "size" : 200 } } } }

I've noticed that if the index is not being written to, after a couple times of running this aggregation, the results are cached, and return instantaneously. The initial query, however, can take 30+ seconds. To take advantage of the caching behavior, I'm currently querying "yesterdays" index, which will be relatively static. I would like to get "today's" data if possible, but the index is written to at a constant rate of 4 million documents per minute. From what I understand, this means the cache will virtually never be used, as its invalidated every segment refresh.

I'm looking for advice on a few points.

For the static index, how can I improve the initial query time? Any way to "pre-warm" the cache?
How can I ensure that the results are "always" cached?
How can I improve performance so that I can run this aggregation on an index that is constantly being written to ("today's" index)?
Any other considerations?

More about the cluster:

Use case is log aggregation
Elasticsearch 5.0
23 data nodes (8 core, 32GB mem each)
7 indexes (rolling, one for each day), roughly 5 billion documents each
23 shards per index (1 per node) with 1 replica for each shard
ingestion rate is roughly 4 million logs per minute

Christian_Dahlqvist · December 20, 2016, 6:19pm

One way to potentially speed this up quite a bit, at least as long as you are willing to accept a reasonably small lag, would be to every minute perform the aggregation against just the records inserted during the last minute (assuming you have a timestamp field for this) and store the resulting records in a separate small time-based index. You can the query this separate index, which will be much smaller, with low latencies.

jpountz · December 20, 2016, 6:27pm

This is correct. But that also means that you can trade some realtimeness for query latency by increasing the refresh interval (by default 1 second). For instance if you increase the refresh interval to 30s, then results will be cached up to 30 seconds, the drawback being that users cannot search or aggregate data that are less than 30 seconds old.

No there is no way to warm it at the moment. This should only happen once in the lifetime of the index though, so I presume it is not really an issue, or do I miss something?

I suspect a significant portion of the aggregation time is spent building global ordinals, so you could move this cost from query time to refresh time by warming up global ordinals for the role field at refresh time. Tune for search speed | Elasticsearch Guide [5.1] | Elastic

mikeydoyle · December 20, 2016, 6:43pm

Thanks for the ideas. I'll check out eagerly loading the global ordinals, and see how that goes. I'm already refreshing on 30s intervals.

system · January 17, 2017, 6:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow terms aggregation speed on ~130M documents Elasticsearch	34	7961	May 10, 2019
Elasticsearch terms aggregations performance of many unique values Elasticsearch	7	1609	July 13, 2020
Aggregation to take the first result for every unique value of a term Elasticsearch	4	5383	February 20, 2018
Index speed? Elasticsearch	2	719	February 15, 2017
Optimizing term aggregations for multivalued fields Elasticsearch	1	651	July 5, 2017

Unique Term Values via Aggregation - Performance Considerations

Related topics