Max Shard Size: 50GB versus 100GB?

geena.rollins · March 31, 2017, 5:07pm

We will be indexing 400GB of data/day into daily indexes.

Elastic recommends Max Shard Size of 50GB to avoid recovery/reallocation issues. Putting that aside, how much slower would aggregation queries take? 20% longer, 100% longer? Cutting the number of primaries from 8 to 4 may offset some performance loss.

The aggregation queries have a constant-score filter with unanalyzed terms and date range clauses, aggs of terms over time intervals.

Christian_Dahlqvist · March 31, 2017, 5:40pm

Each aggregation runs single-threaded against each shard (multiple shards can be aggregated against in parallel and multiple aggregations can run in parallel against the same shard), so as far as I have been able to tell from my benchmarking the time an aggregation takes against a shard is basically proportional to the number of records aggregated over. If you have a twice as large shard I would expect the minimum processing time for the same query/aggregation to take twice as long assuming the same proportion of records is aggregated across. In a real life scenario it is naturally not quite as simple as multiple shards can be queried in parallel and you generally have other processing competing for resources.

To be sure the best way is to benchmark it.

geena.rollins · April 1, 2017, 3:40am

Thank you so much Christian. That was my experience with plain queries, linearly scales with doc count. For this cold cluster, it seems that searching the same doc count, 1,500 primary shards would outperform 3,000 primaries because half as many search requests would be added to the search thread pool. For 5GB shards, I've seen Elasticsearch perform well with 1,500 shards and very poorly with 3,000 shards.

With the algorithms we want to encode into Elasticsearch queries, it is not possible to partition the data.

It is expensive to test big data in AWS. Your advice shows me it is worth a POC.

Christian_Dahlqvist · April 1, 2017, 7:46am

Yes, it is always about finding a good balance, which is why we recommend benchmarking. Querying a large number of small shards also has overhead, but naturally depend on how many nodes you distribute these across. Going extremely small or extremely big is unfortunately rarely, if ever, the correct answer.

system · April 29, 2017, 7:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Max shard size for a very large single index Elasticsearch	5	1722	April 7, 2020
Correct number of shards for 5.3 TB indices Elasticsearch	10	2167	May 18, 2017
Aggregation performance Elasticsearch	6	337	July 16, 2021
Importance of shard sizing for search performance Elasticsearch	3	416	July 7, 2020
Large shard size Elasticsearch	4	410	December 4, 2021

Max Shard Size: 50GB versus 100GB?

Related topics