Shard size vs. query performance where all shard have the same terms

IanRC72 · August 2, 2019, 3:12pm

As I understand it, ES's query speed is a function of the term dictionary size, not the document count.

In my scenario, I have a large set of data that can be distributed across any arbitrary number of shards. The terms would be basically identical for each shard. The only variable would, of course, be which documents match which term.

As I see it, the more shards I have, the more duplicate term dictionaries I'll have. More disk, more RAM usage. Assuming I have 8 x 16 shards, I'll hit the vCPU limit (8) and cluster node limit (16), thus achieving maximum parallelism.

Is my logic correct? Or is there an advantage I'm missing to having more, smaller shard (document-count-wise) but with duplicate term dictionaries in each?

Christian_Dahlqvist · August 2, 2019, 3:41pm

How many concurrent queries are you expecting to handle? If you optimize for latency (and assume disk I/O is not limiting performance) for a single concurrent query you probably want a reasonably large number of shards, especially if your querying is CPU intensive. If you are expecting a larger number of concurrent queries you the optimal shard count is likely different.

It is hard to reason analytically about this as there are a number of factors that affect latency, including data and queries, so the best best is to benchmark with realistic data, queries and query concurrency. If you index or update data you should also include this in the benchmark as it can affect caching efficiency.

IanRC72 · August 2, 2019, 4:23pm

Thanks @Christian_Dahlqvist. If disk isn't a limiting factor, wouldn't 1 shard per query thread be optimal? I.e. int((# of available_processors * 3) / 2) + 1

But I get that this is a benchmark point. Nevertheless, it seems highly redundant to me to repeat the same terms across a multitude of dictionaries beyond the thread count * node count.

Christian_Dahlqvist · August 3, 2019, 5:28am

One shard per query thread assumes processing will be evenly distributed across the nodes, which may not necessarily be the case. I would probably use less, especially if you expect more than one concurrent query.

IanRC72 · August 3, 2019, 9:52am

I'm not sure I'm understanding you. Can more than 1 thread read the same shard simultaneously? (I don't know ES's locking mechanisms)

Regardless, are you saying to assign less than 1 shard per query thread?

My setup: multiple shards with virtually identical terms dictionaries, heavy concurrent queries, queries are routable to 1 shard.

If we had 1 node with the 8 vCPUs max you indicated in another post, the query thread pool would be 13 × 1. What shard number would you suggest given the data I provided?

I was going with 52, which allows me to start with a 6MiB RAM server and scale to 4 nodes before needing to reindex.

Christian_Dahlqvist · August 3, 2019, 10:35am

Yes, multiple threads can read the same shard in parallel, which is why a smaller shard count makes sense for higher query concurrency. As I said earlier I recommend you benchmark to find the optimum for your workload.

system · August 31, 2019, 10:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
One query thread per Shard? Elasticsearch	6	7164	February 13, 2017
Performance searching single index vs multiple indices Elasticsearch	9	18171	July 27, 2018
Signs of too few shards? Number of documents per shard? Elasticsearch	3	539	July 23, 2021
Importance of shard sizing for search performance Elasticsearch	3	416	July 7, 2020
Concurrent Search in elasticsearch Elasticsearch	7	2156	July 5, 2017

Shard size vs. query performance where all shard have the same terms

Related topics