Index size tradoff, hourly vs daily

jclose · March 29, 2017, 7:24pm

We are creating a daily index right now, with a replica (5 shards). As of right now, we are pulling in around 20GB an hour of info.

So if we let this ride all day, we are looking at some pretty hefty index sizes.

I'm guessing we might want to take these down to an hourly format. Probably looking at having 3-4 weeks of this information.

Is this what is suggested for indexes of this size?

Christian_Dahlqvist · March 29, 2017, 7:27pm

Have you determined a suitable target shard size based on your data and queries? What is the relationship between raw data size and the size it takes up once indexed on disk? What is the size and topology of your cluster?

jclose · March 29, 2017, 7:47pm

So I guess that begets another question: how do we determine that suitable target shard size?

Do we base it upon available disk space? Or do we base some of this on how we plan to actually query the data? I can say that 50% of our searches will be on today and yesterday's data, with the next 25% focussed purely on the last few hours. Data that is 2+ days older will be rarely accessed.

It's hard for me to guess raw-data size vs index size right now, because of the varying ingestion points. We aren't meeting any compliance with this, so the goal is just to keep as much as we can, and roll things off as needed.

Our current cluster is 40 nodes of bare iron, each with 500GB of SSD storage space, which will double to 80 nodes here in a week or two, and will probably increase by another 20-40 nodes a few weeks after that.

Christian_Dahlqvist · March 30, 2017, 5:13am

Shard size will affect the minimum query latency, and this depends on your data, mappings as well as your queries. This is discussed in this video. Given the number of nodes in your cluster, I would say that you are better off going with a daily index with a larger number of shards rather than hourly indices as this allows you to spread out the indexing load.

If you are on Elasticsearch 5.x you can also overshard while indexing in order to spread out the indexing load and then use the shrink API to reduce the shard count for longer term storage once indexing into the index has completed.

There is also a rollover API that may be useful. This is also described in this blog post.

system · April 27, 2017, 5:13am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Is Daily-Index better than Monthly-Index Elasticsearch	6	1945	May 26, 2020
Correct number of shards for 5.3 TB indices Elasticsearch	10	2152	May 18, 2017
Optimal size of shard: Is 80GB Per shard ok for this use case Elasticsearch	4	204	April 25, 2023
Index/Sharding design Elasticsearch	1	326	February 19, 2018
Should i create daily index or weekly? Elasticsearch	2	909	July 24, 2019

Index size tradoff, hourly vs daily

Related topics