Unique Term Values via Aggregation - Performance Considerations


(Michael Doyle) #1

I'm looking to get a distinct list of values for a few terms ("keyword" fields) in my index for display in a UI. I would consider these to be low cardinality (< 200 distinct values, for example). To do this, I have been experimenting with the aggregations API.

e.g.)

{ "aggs" : { "roles" : { "terms" : { "field" : "role", "size" : 200 } } } }

I've noticed that if the index is not being written to, after a couple times of running this aggregation, the results are cached, and return instantaneously. The initial query, however, can take 30+ seconds. To take advantage of the caching behavior, I'm currently querying "yesterdays" index, which will be relatively static. I would like to get "today's" data if possible, but the index is written to at a constant rate of 4 million documents per minute. From what I understand, this means the cache will virtually never be used, as its invalidated every segment refresh.

I'm looking for advice on a few points.

  1. For the static index, how can I improve the initial query time? Any way to "pre-warm" the cache?
  2. How can I ensure that the results are "always" cached?
  3. How can I improve performance so that I can run this aggregation on an index that is constantly being written to ("today's" index)?
  4. Any other considerations?

More about the cluster:

  • Use case is log aggregation
  • Elasticsearch 5.0
  • 23 data nodes (8 core, 32GB mem each)
  • 7 indexes (rolling, one for each day), roughly 5 billion documents each
  • 23 shards per index (1 per node) with 1 replica for each shard
  • ingestion rate is roughly 4 million logs per minute

(Christian Dahlqvist) #2

One way to potentially speed this up quite a bit, at least as long as you are willing to accept a reasonably small lag, would be to every minute perform the aggregation against just the records inserted during the last minute (assuming you have a timestamp field for this) and store the resulting records in a separate small time-based index. You can the query this separate index, which will be much smaller, with low latencies.


(Adrien Grand) #3

This is correct. But that also means that you can trade some realtimeness for query latency by increasing the refresh interval (by default 1 second). For instance if you increase the refresh interval to 30s, then results will be cached up to 30 seconds, the drawback being that users cannot search or aggregate data that are less than 30 seconds old.

No there is no way to warm it at the moment. This should only happen once in the lifetime of the index though, so I presume it is not really an issue, or do I miss something?

I suspect a significant portion of the aggregation time is spent building global ordinals, so you could move this cost from query time to refresh time by warming up global ordinals for the role field at refresh time. https://www.elastic.co/guide/en/elasticsearch/reference/5.1/tune-for-search-speed.html#_warm_up_global_ordinals


(Michael Doyle) #4

Thanks for the ideas. I'll check out eagerly loading the global ordinals, and see how that goes. I'm already refreshing on 30s intervals.


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.