Best practice to save aggregated data to elasticsearch for long time storage?

Hi,

We have a lot of data in ES and we have a retention time of 30 days to have all detail data available.

We've imported the bare logs into ES (parsed some fields, etc of course).
In our dashboards we are doing a lot of aggregations on all our big log sources.

Sometimes it is important for us to compare the current data with old data which is some months old. Currently we are doing this in a quite ugly way. We are "exporting" a month view of a bunch of dashboards as html export or by screenshot.

Now I am thinking of a different approach and need some advice about the easiest and most practical way to reach that target:

  • I would like to create a batch job, which is querying ES each night (when load on the system is low).
  • This job will accumulate the data a bigger time interval (e.g. 1 or 3 hours).
  • The results of these accumulations / aggregations I need to put into ES again, but in another index with a higher retention time. This archival index I would like to keep for a year or longer.

The storage needed for archival will be much much smaller, because for a 6GB log with 4 mio log entrys a day, I will keep only 24 log entries a day, where I store following fields for example:

count
processing time avg
processing time min
processing time max
processing time percentile 25, 50, 75, 90, 95

So what is the easiest way to reach that?
I have following Ideas in my mind, but I am open to new ones :wink:

Idea 1:

  • generate the aggregation via curl and append to a file
  • ship this file to logstash. Most parsing work should be done by json parser. Then I just change the target index and maybe (if needed) the type to prevent interference with the source data.

idea 2:

  • create a shell script
  • do the aggregation
  • insert into elasticsearch directly

idea 3:

  • connect to ES via java api for query and insert.

idea 4:

  • is there a way to do it ES internal?

scheduling may be done via cron

My current favorite is idea 1, because I am familiar with doing aggregations via curl (steal the query string from kibana) and I am familiar with logstash. Also logstash is concerning on all stuff like bulk inserting and so on.

But maybe there are some downsides / pitfalls I do not see yet.

A yeah, we are currently on 5.1.2 stack, but planing to upgrade to 5.latest or better 6.x. So I do not want to implement sth. which needs much changing when upgrading to 6.x or 7.x when it becomes available.

Thanks a lot, Andreas

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.