Sizing for time data flow

(apologies in advance for yet another sizing post)

We are indexing approximately 2KB documents and ingesting about 50 million
documents daily. The index size ends up being about 75GB per day for the
primary shards (doing replication = 1 so 150GB/day). In our use case, after
1 month, we throw away 95% of the data but need to keep the rest
indefinitely. We are planning to use the "time data flow" mentioned in
Shay's presentations and are currently thinking about what time period to
use for each index. With a shorter period, the current month index may
behave better, but we'll end up accumulating lots of smaller indices after
the 1 month period.

We currently have a 4 node setup, each with 12 cores, 96GB of ram and 2TB
of disk space over 4 disks. By my calculations, to hold one year of data
with r=1, we would need 150GB/day * 31 for the initial month, then
150GB/day31.05 for historical months = 4.65TB + 2.5TB = 7+TB for 1 year
of data. This seems pretty tight to me considering additional space may be
needed for merges, etc.

  1. Is accumulating a lot of indexes per node a concern here? If we did a
    daily index with 4 shards and r=1, that would be over 700 shards per node
    for 1 year. I know that there is a memory limitation on the number of
    shards that can be managed by a node.
  2. If we did a monthly index, that would be better for the historical
    indices, but the current month index would be huge, over 2TB.
  3. Is there any difference here between doing a daily index with less
    shards vs. a monthly index with more primary shards?
  4. How would having this many shards affect query performance? I assume
    there is some sweet spot of shards per node that must be found empirically?
    I would guess it's somewhat related to the number of disks/cores per node?
  5. I am also wondering about the RAM to data ratio and whether we'll get
    decent query performance. Due to our use case, we can't use routing. Is
    there any rule of thumb here?
  6. Another option we are considering is to do a daily index for the
    first month, and then have periodic jobs to combine the historical daily
    indexes into larger indices. So for example the first month = 31 daily
    indices and following months will get rolled up into 1 index per month. But
    we only want to do this extra work if it's needed.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b3b6e634-7184-4f7e-ac46-da453917721b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.