Manage old data based on time


#1

Hi guys,

we are working on a system which working on tons of data but tipically our users search for data from the last 3 days. Tipical other options are the last 7, 14, 30 days and we have an option to custom time interval.

Is there any way to make indexes as the following:

  • having an index which contains data just from the last 3 days

  • having an index which contains data between 3-7 days

  • having an index which contains data between 7-14 days

  • having an index which contains data between 14-30 days

  • having a huge index which contains data older than 30 days.

  • if a document became old enough, automatically move it to the proper index based on a timestamp field?


(David Pilato) #2

I'd use typically aliases and still build one index per day.

Just switch aliases everyday then...

For older data (> 30 days), may be it's better to reindex them in another index but you have to think about removal at some point.
If you remove old data (after 3 months for example), then just keep them by day...

My 2 cents


#3

Thanks for your reply.
If I create daily indexes, wouldn't be any performance issue there?
If I search for documents from the last month I have to search in 30 indexes which I guess can be really slow... isn't it?

The other thing you mentioned about deleting data - it's not an option, we have to store ALL data and we have to make it searchable but performance is not a requirement in that case. Maybe if I have a dedicated cluster for old data and having some syncronization?

And you say all these requirements can be fulfilled by own solution?
Thanks


(David Pilato) #4

Elasticsearch scales out. So if you are hitting some limits, you can start new machines.
Of course each shard (so index) comes with a cost. You probably need to create one shard per index (but you have to test that - it depends on the document size and number of docs basically).

Well. It depends. Knowing that search is run in parallel on all shards, if you have enough physical resources, you might not notice a difference. Let me add that searching on one index with 30 shards and on 30 indices with one shard has the same cost.

So, even after 10 years, your data needs to be searchable at any time ?
Then, you'll have to pay a price (hardware and may be reindex).
In that case, you can reindex to reduce the number of indices but keep in mind that you won't index 100 billion documents in one single index with one shard...

I don't know your data, but may be if the amount of data is not that big, you can index per week or per month in a single index?

In that case, use again aliases but filtered aliases. Super handy.

Well, I'd not really use another cluster for that. But I'd probably use dedicated machines for hot indices and others for warm indices. You can do that out of the box with elasticsearch just by defining some index and node settings.


#5

David, thanks for your tips, it seems that we have to build our system by controlling our indices and data continuously and we have to figure (and test) out what is the best amount and logical unit of our data for one index.


(David Pilato) #6

Exact! You have to know what a single shard can hold at best. Then how many fully loaded shards a single machine can hold.

It will give you all metrics you need then to create your infra and design your logical architecture (indices period, number of shards per index...)

Good luck! :slight_smile:


(system) #7