I want to use Elasticsearch as a timeseries database where I store an array of 850 sensor values every hour. The database will store data for several years. Now I want to query all the documents within a given time range and downsample them using a max aggregation. My question is: How do I efficiently compute the maximum value for each array index?
For example I have arrays of five documents and I want to aggregate them into a array of the same size containing the maximum values :
I currently use a scripted metric aggregation where I iterate over each of the 850 values for each document, which results in a quite low performance. Can this be achieved in a more efficient way?
I'm not 100% on the document mappings or the aggregations you're trying to achieve here but if you want to sum only the maximum values found in each elasticsearch document where each doc holds an array at present you will have to use a script. The need for a script would be removed if you held a "max" value on each doc which would obviously be trivial to compute at write time.
Alternatively, you could create a "rolled-up" index using some of the techniques I describe here in building entity-centric indexing: https://www.youtube.com/watch?v=yBf7oeJKH2Y (includes some scripts you can download)
It amounts to the same thing - maintaining a thinned version of events.
Thanks for your fast reply. I dont want to compute the maximum of a single array but the maximum for each index across several arrays.
In the example above I have five input arrays, where the nubmers 1, 3, 1, 5 and 4 are the values of each array at index 0. Therefore the resulting array should store 5 at index 0.
If the element at each index in the array each represent a different metric, I would store these metrics as separate fields so then you can query them separately.
You would have to test to find out if it would be faster than your script solution.
What information are you actually presenting to the user? Surely a user will have trouble processing 850 figures if you are displaying them all at the same time? Are you doing some post processing of these results before you display something to the user? There may be other ways of achieving what you are after, for example, if you are presenting these results in pages, this would give you the opportunity to query only one page (say 20) of the metrics at a time.
I just tried it out and it seems there is no difference. The data is presented to the user all at once as an image. One column in that image would correspond to one aggregated document.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.