I am trying to implement a way to automaticly clean up our data stored into elasticsearch. We want to kept only recent data for example < 6 months.
I have seen that the best practice for temporal data is rolling indicies. But I wanted to know if anybody has an other idea as our indices are not good for that and we can't change them for now (too many complex custom aggregations script).
Our data is organized like that :
One document per devices and each document contains an array of nested element. Each nested element correspond to one day of data.
Here are the eventual solutions :
We have the possibility to re compute all the indicies content so one solution is to recompute all the content of the indicies weekly or monthly in a new index upon the desired date, but it is very expensive (there is a computational phase to retrieve these data).
An other solution would be to do a kind of reindex in which we would have to exclude some of the nested document for each document. Is possible do to an Elasticsearch reindex using a custom script to do that ?
How much data is stored at the device level? How often is this updated? Have you considered simply denormalizing the data, which would allow you to use time based indices?
using nested documents this way will make any changes to documents expensive as the number of data points within grows. This applies to both adding or deleting items as the entire document need to be reindexed.
There are hundreds of thousands devices. The global size of an indice (including nested document) is around 40 GB. It is update around one time per hour per device.
Denormalizing the data to use time based indices has be considred but will not be done for now in this case. We are first doing this denormalization on a smaller use case with less aggregates and we realized that using flat indices makes our aggregations much more complex to be written. (But at one point we will probably switch depending of the result we get).
As any kind of change to a document requires all parts to be reindexed I do not think there is any more efficient way to do it than you have already described. Processing will get more and more expensive the larger the nested documents get so if you are expecting to store data pints for a long time I would recommend reconsidering now rather than later.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.