We will be indexing 400GB of data/day into daily indexes.
Elastic recommends Max Shard Size of 50GB to avoid recovery/reallocation issues. Putting that aside, how much slower would aggregation queries take? 20% longer, 100% longer? Cutting the number of primaries from 8 to 4 may offset some performance loss.
The aggregation queries have a constant-score filter with unanalyzed terms and date range clauses, aggs of terms over time intervals.
Each aggregation runs single-threaded against each shard (multiple shards can be aggregated against in parallel and multiple aggregations can run in parallel against the same shard), so as far as I have been able to tell from my benchmarking the time an aggregation takes against a shard is basically proportional to the number of records aggregated over. If you have a twice as large shard I would expect the minimum processing time for the same query/aggregation to take twice as long assuming the same proportion of records is aggregated across. In a real life scenario it is naturally not quite as simple as multiple shards can be queried in parallel and you generally have other processing competing for resources.
Thank you so much Christian. That was my experience with plain queries, linearly scales with doc count. For this cold cluster, it seems that searching the same doc count, 1,500 primary shards would outperform 3,000 primaries because half as many search requests would be added to the search thread pool. For 5GB shards, I've seen Elasticsearch perform well with 1,500 shards and very poorly with 3,000 shards.
With the algorithms we want to encode into Elasticsearch queries, it is not possible to partition the data.
It is expensive to test big data in AWS. Your advice shows me it is worth a POC.
Yes, it is always about finding a good balance, which is why we recommend benchmarking. Querying a large number of small shards also has overhead, but naturally depend on how many nodes you distribute these across. Going extremely small or extremely big is unfortunately rarely, if ever, the correct answer.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.