Sizing for time data flow

slushi · May 12, 2014, 5:25pm

(apologies in advance for yet another sizing post)

We are indexing approximately 2KB documents and ingesting about 50 million
documents daily. The index size ends up being about 75GB per day for the
primary shards (doing replication = 1 so 150GB/day). In our use case, after
1 month, we throw away 95% of the data but need to keep the rest
indefinitely. We are planning to use the "time data flow" mentioned in
Shay's presentations and are currently thinking about what time period to
use for each index. With a shorter period, the current month index may
behave better, but we'll end up accumulating lots of smaller indices after
the 1 month period.

We currently have a 4 node setup, each with 12 cores, 96GB of ram and 2TB
of disk space over 4 disks. By my calculations, to hold one year of data
with r=1, we would need 150GB/day * 31 for the initial month, then
150GB/day31.05 for historical months = 4.65TB + 2.5TB = 7+TB for 1 year
of data. This seems pretty tight to me considering additional space may be
needed for merges, etc.

Is accumulating a lot of indexes per node a concern here? If we did a
daily index with 4 shards and r=1, that would be over 700 shards per node
for 1 year. I know that there is a memory limitation on the number of
shards that can be managed by a node.
If we did a monthly index, that would be better for the historical
indices, but the current month index would be huge, over 2TB.
Is there any difference here between doing a daily index with less
shards vs. a monthly index with more primary shards?
How would having this many shards affect query performance? I assume
there is some sweet spot of shards per node that must be found empirically?
I would guess it's somewhat related to the number of disks/cores per node?
I am also wondering about the RAM to data ratio and whether we'll get
decent query performance. Due to our use case, we can't use routing. Is
there any rule of thumb here?
Another option we are considering is to do a daily index for the
first month, and then have periodic jobs to combine the historical daily
indexes into larger indices. So for example the first month = 31 daily
indices and following months will get rolled up into 1 index per month. But
we only want to do this extra work if it's needed.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3b6e634-7184-4f7e-ac46-da453917721b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Is Daily-Index better than Monthly-Index Elasticsearch	6	1932	May 26, 2020
Index optimum size Elasticsearch Cluster Elasticsearch	4	410	June 19, 2018
When do you need more then 1 shard? Elasticsearch	12	1851	July 6, 2017
Elastic sizing Elasticsearch	8	3772	December 14, 2017
Size index Elasticsearch	4	528	January 30, 2018

Sizing for time data flow

Related topics