Hourly to Daily Aggregations

We have weekly indices that contains hourly aggregations. We want to move them to monthly indices with daily aggregations.
We are using 19 dimensions and 10 metrics, when Im doing a query on all of these metrics, ES is filling all its heap space and the query never returns and the ES itself is getting stuck with OutOfMemory exception.

What I thought on doing, Is run a tool like stream2es, with the aggregation query and index the output to anther index.
Can you please advice me with the best practices for the proccess/not to overload ES?

Thanks in advance,

The 2 approaches for rolling up accurately would be as follows:

  1. Examine all the data in your client code
  2. Use the aggs framework to summarise the data for you.

Option 1 involves using the scan/scroll API to stream the data sorted by your chosen summary dimensions (e.g. hour/website) and reducing in your client code before writing using the bulk API to a new index (see [1] )
Option 2 involves you repeatedly calling the aggs framework for a subset of the data e.g.

for all websites:
    aggs call to get daily stats for website

This can be a lot of calls if your grouping field is high cardinality so one way of breaking it up into a smaller set of single requests is to adopt the hash/modulo approach outlined here [2]


[1] https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes and http://bit.ly/entcent
[2] Getting accurate cardinality for a field in single shard index

Thanks for the answer! :slight_smile:

I`m afraid I did not understand you properly. Lets say I have this query https://gist.github.com/Alexk-Ybrant/ecdce68d691e05ce22f699b7bfa42199, and all of the data is already stored in ES.
The client in this case is the same ES server. I want only to move the daily aggregated data to a new index.

About the second option, how can I query just a subset of the data? I want it to be aggregated with all the fields.


That's a crazy number of dimensions to pull out in a query :slight_smile:
I assume this is theoretical because the upper limit for the leaves in the resulting tree is

numTermsInDim1 x numTermsInDim2 x  ...... numTermsInDim19

in other words a very big number [1]. In my example I had assumed just one dimension of summary was required (summary per website) but it's not possible to preserve so many possible permutations of summaries for future analysis.

[1] Wheat and chessboard problem - Wikipedia