Hourly to Daily Aggregations

alexkru · August 23, 2016, 9:08am

Hi,
We have weekly indices that contains hourly aggregations. We want to move them to monthly indices with daily aggregations.
We are using 19 dimensions and 10 metrics, when Im doing a query on all of these metrics, ES is filling all its heap space and the query never returns and the ES itself is getting stuck with OutOfMemory exception.

What I thought on doing, Is run a tool like stream2es, with the aggregation query and index the output to anther index.
Can you please advice me with the best practices for the proccess/not to overload ES?

Thanks in advance,
Alex.

Mark_Harwood · August 23, 2016, 9:24am

The 2 approaches for rolling up accurately would be as follows:

Examine all the data in your client code
Use the aggs framework to summarise the data for you.

Option 1 involves using the scan/scroll API to stream the data sorted by your chosen summary dimensions (e.g. hour/website) and reducing in your client code before writing using the bulk API to a new index (see [1] )
Option 2 involves you repeatedly calling the aggs framework for a subset of the data e.g.

for all websites:
    aggs call to get daily stats for website

This can be a lot of calls if your grouping field is high cardinality so one way of breaking it up into a smaller set of single requests is to adopt the hash/modulo approach outlined here [2]

Cheers
Mark

[1] https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes and http://bit.ly/entcent
[2] Getting accurate cardinality for a field in single shard index

alexkru · August 23, 2016, 11:27am

Hi,
Thanks for the answer!

I`m afraid I did not understand you properly. Lets say I have this query https://gist.github.com/Alexk-Ybrant/ecdce68d691e05ce22f699b7bfa42199, and all of the data is already stored in ES.
The client in this case is the same ES server. I want only to move the daily aggregated data to a new index.

About the second option, how can I query just a subset of the data? I want it to be aggregated with all the fields.

Thanks,
Alex.

Mark_Harwood · August 23, 2016, 11:45am

That's a crazy number of dimensions to pull out in a query
I assume this is theoretical because the upper limit for the leaves in the resulting tree is

numTermsInDim1 x numTermsInDim2 x  ...... numTermsInDim19

in other words a very big number [1]. In my example I had assumed just one dimension of summary was required (summary per website) but it's not possible to preserve so many possible permutations of summaries for future analysis.

[1] Wheat and chessboard problem - Wikipedia

Topic		Replies	Views
Rollup data in ES Elasticsearch	3	1642	July 6, 2017
Summarize day every day Elasticsearch	4	347	June 2, 2019
Rollup strategy in Elastic Elasticsearch	1	808	December 11, 2017
Granulated data - elasticsearch. Is it possible to convert minutely data to hourly data and store it as a new index? Elasticsearch	2	541	August 31, 2017
Aggregate Index Data Daily to another Index Logstash	4	659	July 17, 2018

Hourly to Daily Aggregations

Related topics