I'm looking to get a distinct list of values for a few terms ("keyword" fields) in my index for display in a UI. I would consider these to be low cardinality (< 200 distinct values, for example). To do this, I have been experimenting with the aggregations API.
I've noticed that if the index is not being written to, after a couple times of running this aggregation, the results are cached, and return instantaneously. The initial query, however, can take 30+ seconds. To take advantage of the caching behavior, I'm currently querying "yesterdays" index, which will be relatively static. I would like to get "today's" data if possible, but the index is written to at a constant rate of 4 million documents per minute. From what I understand, this means the cache will virtually never be used, as its invalidated every segment refresh.
I'm looking for advice on a few points.
For the static index, how can I improve the initial query time? Any way to "pre-warm" the cache?
How can I ensure that the results are "always" cached?
How can I improve performance so that I can run this aggregation on an index that is constantly being written to ("today's" index)?
Any other considerations?
More about the cluster:
Use case is log aggregation
Elasticsearch 5.0
23 data nodes (8 core, 32GB mem each)
7 indexes (rolling, one for each day), roughly 5 billion documents each
23 shards per index (1 per node) with 1 replica for each shard
ingestion rate is roughly 4 million logs per minute
One way to potentially speed this up quite a bit, at least as long as you are willing to accept a reasonably small lag, would be to every minute perform the aggregation against just the records inserted during the last minute (assuming you have a timestamp field for this) and store the resulting records in a separate small time-based index. You can the query this separate index, which will be much smaller, with low latencies.
This is correct. But that also means that you can trade some realtimeness for query latency by increasing the refresh interval (by default 1 second). For instance if you increase the refresh interval to 30s, then results will be cached up to 30 seconds, the drawback being that users cannot search or aggregate data that are less than 30 seconds old.
No there is no way to warm it at the moment. This should only happen once in the lifetime of the index though, so I presume it is not really an issue, or do I miss something?
I suspect a significant portion of the aggregation time is spent building global ordinals, so you could move this cost from query time to refresh time by warming up global ordinals for the role field at refresh time. Tune for search speed | Elasticsearch Guide [5.1] | Elastic
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.