Manage old data based on time

edeak · January 18, 2016, 1:03pm

Hi guys,

we are working on a system which working on tons of data but tipically our users search for data from the last 3 days. Tipical other options are the last 7, 14, 30 days and we have an option to custom time interval.

Is there any way to make indexes as the following:

having an index which contains data just from the last 3 days
having an index which contains data between 3-7 days
having an index which contains data between 7-14 days
having an index which contains data between 14-30 days
having a huge index which contains data older than 30 days.
if a document became old enough, automatically move it to the proper index based on a timestamp field?

dadoonet · January 18, 2016, 1:50pm

I'd use typically aliases and still build one index per day.

Just switch aliases everyday then...

For older data (> 30 days), may be it's better to reindex them in another index but you have to think about removal at some point.
If you remove old data (after 3 months for example), then just keep them by day...

My 2 cents

edeak · January 18, 2016, 2:20pm

Thanks for your reply.
If I create daily indexes, wouldn't be any performance issue there?
If I search for documents from the last month I have to search in 30 indexes which I guess can be really slow... isn't it?

The other thing you mentioned about deleting data - it's not an option, we have to store ALL data and we have to make it searchable but performance is not a requirement in that case. Maybe if I have a dedicated cluster for old data and having some syncronization?

And you say all these requirements can be fulfilled by own solution?
Thanks

dadoonet · January 18, 2016, 2:51pm

Elasticsearch scales out. So if you are hitting some limits, you can start new machines.
Of course each shard (so index) comes with a cost. You probably need to create one shard per index (but you have to test that - it depends on the document size and number of docs basically).

Well. It depends. Knowing that search is run in parallel on all shards, if you have enough physical resources, you might not notice a difference. Let me add that searching on one index with 30 shards and on 30 indices with one shard has the same cost.

So, even after 10 years, your data needs to be searchable at any time ?
Then, you'll have to pay a price (hardware and may be reindex).
In that case, you can reindex to reduce the number of indices but keep in mind that you won't index 100 billion documents in one single index with one shard...

I don't know your data, but may be if the amount of data is not that big, you can index per week or per month in a single index?

In that case, use again aliases but filtered aliases. Super handy.

Well, I'd not really use another cluster for that. But I'd probably use dedicated machines for hot indices and others for warm indices. You can do that out of the box with elasticsearch just by defining some index and node settings.

edeak · January 18, 2016, 3:20pm

David, thanks for your tips, it seems that we have to build our system by controlling our indices and data continuously and we have to figure (and test) out what is the best amount and logical unit of our data for one index.

dadoonet · January 18, 2016, 4:44pm

Exact! You have to know what a single shard can hold at best. Then how many fully loaded shards a single machine can hold.

It will give you all metrics you need then to create your infra and design your logical architecture (indices period, number of shards per index...)

Good luck!

Topic		Replies	Views
Sharding by time Elasticsearch	16	1407	July 6, 2017
How to manage rolling indexes with non-static data Elasticsearch	2	470	March 10, 2017
Indexing by time and deleting indexes by time Elasticsearch	4	372	July 6, 2017
Tradeoffs for using week/month (time) based indices Elasticsearch	3	373	July 6, 2017
Is Daily-Index better than Monthly-Index Elasticsearch	6	1965	May 26, 2020

Manage old data based on time

Related topics