Optimal shards: 1 or number of nodes? Considerations

The question of optimal number of shards per index has been asked before and it is almost always dependent on use-case. However, I am wondering what some of the considerations should be.

Imagine the following setup:

  • ElasticSearch 6
  • Monolithic per-customer indices. No time-based.
  • Cross-index search never happens
  • No shard routing possible due to use case and data model
  • Hard to predict growth per index
  • About 200-300 indices
  • About 1TB of data without replicas
  • Complex mappings. Imagine 100-200 fields with many nested
  • Complex queries (1000+ formatted JSON lines)
  • 8 data node cluster

Now when I look at SO answers like this or blog posts like this the advise seems to be to have just enough shards to keep your queries working. Anything else is unnecessary overhead. If possible, you should try to stick with 1 shard per index.

However, this way you won't be able to leverage the full power of the 8 nodes. If you'd have 8 shards, spread evenly, you'd divide the work for your one query over all nodes. Is the overhead from mapreducing the shard results so great that this is not recommended? What would influence this decision or recommendation? Data model complexity, query complexity? When would one be preferable over the other?

What about more primary shards per node? Will that leverage multiple CPU cores efficiently or always cause more overhead then what it's worth?

Finally, how do replicas influence performance? If I understand correctly, each (primary) shard needs to be queried exactly once. Having more replicas of each primary simply means it can be executed on a node which is less busy. In a cluster that is equally balanced in load it will not affect performance.

So to sum up, what should be considered when designing number of shards. Should the goal always be = 1? = number_of_nodes? Or even > number_of_nodes?

To get the most from a cluster we often recommend trying to have a shard size between 20 GB and 50GB. The reason for this is that having lots of very small shards is inefficient. In your case I would recommend keeping a single primary shard per index. If this results in larger shards than mentioned above for some customers I would consider giving just those customers a slightly higher number of shards.

Rather than trying to get all nodes involved in each request I would assume traffic from different customers would even out load across the cluster.

Can you think of any factors that might affect that recommended shard size? For example, we've been running into circuitbreaker exceptions on default 5-shard indices where the average shard had grown to about > 30GB.

This was back on ES 2.x but with even relative simple queries having a sort on a timestamp field. Not sure if anything has changed here between ES 2 and 6 that would affect that recommendation.

It seems that with our setup, an average shard size of 20-50GB would be quite high, since stuff already starts breaking while we are still on the lower end of that recommendation.

Can you think of anything that might impact this recommendation for our setup? Could it be a complex mapping? Relative large documents? Or anything else?

That shard size is a general guideline, and can vary from use-case to use-case. Circuit breaker may indicate that your heap usage is too high for the current cluster configuration rather than that your shards are too big. Lots of small shards can also lead to increased heap usage. The best way to find out the ideal setting for your use-case is to test and benchmark.

I recently did some benchmarking with very big queries that yielded 0 results and empty queries which gave some pretty weird results. I was running against copies of the same index with different number of shards, using Rally with 8 simultaneous threads and different query setups. It seemed most logical to track 'ops/s' metric for cluster throughput.

For those complex queries that do not give any results, it seems like the less shards, the more ops/s throughput the cluster can provide. However, for empty queries the result seems to be the inverse. Any query that yields results seems to perform better on indices which has more shards.

This seems to be in direct contrast to what most recommendations are, which is to keep the number of shards to a minimum.

I tested with an index of about 1.5M documents / 10GB. I tested with 20, 5 and 1 shard configurations.

If you have a small data set and most data is held in the file system cache, you may be limited by CPU. If you have 5 shards, 5 threads can work on that in parallel while only one thread can work on an index with a single shard. All concurrent queries will however execute in parallel, so I would expect the answer to depend on how many threads you have available and what number of concurrent queries you are targeting.

If your data set in reality is larger and/or you will have a higher number of concurrent queries, this may very well change the results, so always make sure you benchmark as realistic a scenario as you can.

The data set is a realistic representation and requests are indeed CPU bound. However, with 8 clients in parallel, the benchmark on 20 shards seems to consume nearly all CPU of the entire cluster. Having a smaller shard size seems to increase the latency and decrease the ops/s, but at least the cluster won't be fully overloaded during the handling of this batch.

Would it be a good conclusion to draw that less shards allows you to service more concurrent requests at once without running the risk of overloading the cluster?

We are currently on 20 shards for all 250 indices (=14K shards) which is causing exactly this: 1 heavy user will bring the entire cluster into critical load, causing time outs for all others.

I am looking into bringing the number of shards down to more sensible values (leaning towards 4 for small and 8 for big ( >20GB) indices), but I am having a tough time benchmarking correctly to get to these values, especially since it seems that ops/s actually goes down with less shards.

I am using Rally to fire (just) queries directly at our production cluster (not feasible to set up a realistic copy of this environment for lab tests).

That sounds like an excessive amount of shards. You should look to have at most hundreds of shards per node, not thousands. Make sure you do not have too many small shards as this can increase overhead.

Did you have the full data set and shard count in the cluster at the time of the benchmark? If so, what did heap usage and GC look like?

If you were running the benchmark on a node with considerably less data than is expected in production, I would be sceptical to draw conclusions based on the results as you may get more data cached in the page cache and have less heap pressure than what you would see in production.

Thats exactly what I thought, but while running my benchmarks to prove that less shards is better, it actually seems that these high number of shards increase the throughput, not decrease it due to overhead!

The index I tested against was a real production index on our production cluster. With about 15GB total it is about average. Indices range between empty and ~150GB in primaries. All are currently on 20 shards. I made copies of this index in 20, 5 and 1 shard configuration (on the same cluster). Total shards is stable at roughly 14K across 8 data nodes. Total cluster size is about 2TB including 2 replicas.

Each (data) node has 12GB heap out of 35GB total memory and 10 CPU cores.

During my benchmarks I looked at the system load live via Cerebro. The used heap seems to remain stable around 50%, fluctuating between 30-70% but nothing extreme. Not sure about GC patterns. When firing the same benchmark against the 20 shard index I see higher ops/s, but also the CPU of all nodes goes to 100%. When firing the same benchmark against the 5 shard index I see less ops/s, but also less load on the system.

All benchmarks are run using 8 clients with a warmup=iterations=250 setup.

Am I looking at the wrong metrics? So far everything points to that more shards is better, which is in direct contrast to what all other guides are writing.

When you run the benchmark, do the following:

  • Make sure the node has a mix of indices and shards that is similar to what would be the case in production.
  • Make sure the total data volume is the same as you expect to have in production.
  • Do not send all the queries just towards a single index unless that is what you expect in production. Make sure that a realistic number of indices are accessed/queried as this is likely to have an impact on the page cache usage as well as disk I/O patterns.
  • Run with the number of concurrent queries you expect to see in production.
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.