I would like to know more about aggregations and cached requests. Is there a way to pre-calculate aggregate data for pie charts and other graphs? I would like to improve the requests speed for the early data.
Also, is there a way to have the last X millions of points (or measured in Bytes) cached in memory? And, when we request data, how do we tell that we don't need some of the fields. Does kibana optimize this?
Is there a way to aggregate old data to reduce the resolution?
For the most part the disk caching behavior in modern OSes just does the whole "cache x bytes" thing. Aggregations use a column store format to power their job so they already don't load all the data. If you are using doc values the column store is on disk, if you are using field data (old, bad) the column store has to be materialized in memory on first request. So for field data the disk cache doesn't come into it, really.
Not a built in one but its something I've been thinking about for a few months. As an rrdtool fan and Elasticsearch contributor and elastic employee I really would like to borrow some of rrdtool's concepts one day. But that is a long way off for me I think.
I don't know offhand how the query cache interacts with aggregations, sorry.
This would be awesome ! Also Kibana lacks in some basic features like customizing axis names.
I just trying to push one node of elasticsearch to index more than 60K documents per second (I once got 80K). Documents as big as a HTTP transaction with its URL, response code etc. But also I would like to improve the way kibana requests the data.
When you chose to show 1 week of data with one minute intervals what kind of request kibana does to elasticsearch? I assume elasticsearch is the one who calculates the aggregation to give the one minute intervals. Is there a way to avoid this slow calculation and pre-calc it before? Like, "ey E·S, I'm gonna request the last week data really often in these intervals so, how about calculating it in advance?"
in modern OSes just does the whole "cache x bytes" thing.
Yes, the raw data from the index would be cached by the system, but I want to cache the result of the calculations made with this data.
Yeah - elasticsearch does the number crunching. I don't think there is much of a way to do this but I suspect @polyfractal (who I hope this ping reaches) knows much much more.
The rrd-like features I keep thinking about are the history in fixed sized bins thing - the way it loses precision as you go back in time by combining bins in ways that are still amenable to the aggregations it supports. Its a beautiful design because it limits the size of your data while supporting massive retention.
Isn't this feature something that would help out a lot?
The only current problem is that kibana always sends different requests.
E.g.
query0: 00:00:00 to 00:05:00
refresh after 10s
query1: 00:00:10 to 00:05:10 -> no cache
refresh after 10s
query2: 00:00:20 to 00:05:20 -> no cache
I think that a magnet/snap function would improve things a lot. Something like this:
query0: 00:00:00 to 00:05:00 -> OK
query1: 00:00:10 to 00:05:10 -> gets transformed to 00:00:00 to 00:05:00 AND 00:05:00 to 00:10:00 (to cover new data as well, but this would be the downside) -> already cached!
With this magnet/snap, the cache could be used more often
Correct. Kibana is firing off various date_histogram aggregations, based on the options you've toggled in the dashboard (fields, interval, range, etc). Elasticsearch is then building the buckets at the specified interval and returning the result.
This is exactly what I was going to suggest The query-cache is designed to help out with situations like this, where a single query/agg is repeatedly executed. If the response doesn't change, it can be cached.
But, as @lwintergerst points out, Kibana doesn't play very nicely with the feature yet because the time ranges slide and invalidate immediately. Using custom time ranges instead of now- would work, or something like the suggested "magnet" functionality.
In the future, we'd like to make the query cache "smarter". For example, if you are querying five days across five indices, even if you do now - 5d we know that the "interior" three indices are valid from the last query, and could cache those shard level results. That means only the two "edge" indices would need to be re-queried.
It's not possible today, but there is a lot of work going on to clean up the internal Query parsing to make stuff like this possible.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.