So I guess that begets another question: how do we determine that suitable target shard size?
Do we base it upon available disk space? Or do we base some of this on how we plan to actually query the data? I can say that 50% of our searches will be on today and yesterday's data, with the next 25% focussed purely on the last few hours. Data that is 2+ days older will be rarely accessed.
It's hard for me to guess raw-data size vs index size right now, because of the varying ingestion points. We aren't meeting any compliance with this, so the goal is just to keep as much as we can, and roll things off as needed.
Our current cluster is 40 nodes of bare iron, each with 500GB of SSD storage space, which will double to 80 nodes here in a week or two, and will probably increase by another 20-40 nodes a few weeks after that.