As I understand it, ES's query speed is a function of the term dictionary size, not the document count.
In my scenario, I have a large set of data that can be distributed across any arbitrary number of shards. The terms would be basically identical for each shard. The only variable would, of course, be which documents match which term.
As I see it, the more shards I have, the more duplicate term dictionaries I'll have. More disk, more RAM usage. Assuming I have 8 x 16 shards, I'll hit the vCPU limit (8) and cluster node limit (16), thus achieving maximum parallelism.
Is my logic correct? Or is there an advantage I'm missing to having more, smaller shard (document-count-wise) but with duplicate term dictionaries in each?
How many concurrent queries are you expecting to handle? If you optimize for latency (and assume disk I/O is not limiting performance) for a single concurrent query you probably want a reasonably large number of shards, especially if your querying is CPU intensive. If you are expecting a larger number of concurrent queries you the optimal shard count is likely different.
It is hard to reason analytically about this as there are a number of factors that affect latency, including data and queries, so the best best is to benchmark with realistic data, queries and query concurrency. If you index or update data you should also include this in the benchmark as it can affect caching efficiency.
Thanks @Christian_Dahlqvist. If disk isn't a limiting factor, wouldn't 1 shard per query thread be optimal? I.e. int((# of available_processors * 3) / 2) + 1
But I get that this is a benchmark point. Nevertheless, it seems highly redundant to me to repeat the same terms across a multitude of dictionaries beyond the thread count * node count.
One shard per query thread assumes processing will be evenly distributed across the nodes, which may not necessarily be the case. I would probably use less, especially if you expect more than one concurrent query.
I'm not sure I'm understanding you. Can more than 1 thread read the same shard simultaneously? (I don't know ES's locking mechanisms)
Regardless, are you saying to assign less than 1 shard per query thread?
My setup: multiple shards with virtually identical terms dictionaries, heavy concurrent queries, queries are routable to 1 shard.
If we had 1 node with the 8 vCPUs max you indicated in another post, the query thread pool would be 13 × 1. What shard number would you suggest given the data I provided?
I was going with 52, which allows me to start with a 6MiB RAM server and scale to 4 nodes before needing to reindex.
Yes, multiple threads can read the same shard in parallel, which is why a smaller shard count makes sense for higher query concurrency. As I said earlier I recommend you benchmark to find the optimum for your workload.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.