I'm looking to get a distinct list of values for a few terms ("keyword" fields) in my index for display in a UI. I would consider these to be low cardinality (< 200 distinct values, for example). To do this, I have been experimenting with the aggregations API.
e.g.)
{ "aggs" : { "roles" : { "terms" : { "field" : "role", "size" : 200 } } } }
I've noticed that if the index is not being written to, after a couple times of running this aggregation, the results are cached, and return instantaneously. The initial query, however, can take 30+ seconds. To take advantage of the caching behavior, I'm currently querying "yesterdays" index, which will be relatively static. I would like to get "today's" data if possible, but the index is written to at a constant rate of 4 million documents per minute. From what I understand, this means the cache will virtually never be used, as its invalidated every segment refresh.
I'm looking for advice on a few points.
- For the static index, how can I improve the initial query time? Any way to "pre-warm" the cache?
- How can I ensure that the results are "always" cached?
- How can I improve performance so that I can run this aggregation on an index that is constantly being written to ("today's" index)?
- Any other considerations?
More about the cluster:
- Use case is log aggregation
- Elasticsearch 5.0
- 23 data nodes (8 core, 32GB mem each)
- 7 indexes (rolling, one for each day), roughly 5 billion documents each
- 23 shards per index (1 per node) with 1 replica for each shard
- ingestion rate is roughly 4 million logs per minute