Have you determined a suitable target shard size based on your data and queries? What is the relationship between raw data size and the size it takes up once indexed on disk? What is the size and topology of your cluster?
So I guess that begets another question: how do we determine that suitable target shard size?
Do we base it upon available disk space? Or do we base some of this on how we plan to actually query the data? I can say that 50% of our searches will be on today and yesterday's data, with the next 25% focussed purely on the last few hours. Data that is 2+ days older will be rarely accessed.
It's hard for me to guess raw-data size vs index size right now, because of the varying ingestion points. We aren't meeting any compliance with this, so the goal is just to keep as much as we can, and roll things off as needed.
Our current cluster is 40 nodes of bare iron, each with 500GB of SSD storage space, which will double to 80 nodes here in a week or two, and will probably increase by another 20-40 nodes a few weeks after that.
Shard size will affect the minimum query latency, and this depends on your data, mappings as well as your queries. This is discussed in this video. Given the number of nodes in your cluster, I would say that you are better off going with a daily index with a larger number of shards rather than hourly indices as this allows you to spread out the indexing load.
If you are on Elasticsearch 5.x you can also overshard while indexing in order to spread out the indexing load and then use the shrink API to reduce the shard count for longer term storage once indexing into the index has completed.
There is also a rollover API that may be useful. This is also described in this blog post.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.