The data set is a realistic representation and requests are indeed CPU bound. However, with 8 clients in parallel, the benchmark on 20 shards seems to consume nearly all CPU of the entire cluster. Having a smaller shard size seems to increase the latency and decrease the ops/s, but at least the cluster won't be fully overloaded during the handling of this batch.
Would it be a good conclusion to draw that less shards allows you to service more concurrent requests at once without running the risk of overloading the cluster?
We are currently on 20 shards for all 250 indices (=14K shards) which is causing exactly this: 1 heavy user will bring the entire cluster into critical load, causing time outs for all others.
I am looking into bringing the number of shards down to more sensible values (leaning towards 4 for small and 8 for big ( >20GB) indices), but I am having a tough time benchmarking correctly to get to these values, especially since it seems that ops/s actually goes down with less shards.
I am using Rally to fire (just) queries directly at our production cluster (not feasible to set up a realistic copy of this environment for lab tests).