I'm experimenting with custom routing to bring down the latency of search requests in my cluster. The routing_ids are assigned in groups of N. For ex. a key with 1000 associated documents will be mapped to 10 different routing_ids(key-0, key-1....key-10) when indexing. When searching, all the routing_ids will be used. This works well with keys having relatively small number of docs. But for keys with 500K docs, the number of routing keys are large(For ex. 4500 routing_ids) in that when passed as parameter, I get HTTP 414 URI too long error. I believe this is because of the header size limit of 8KB.
I found an old SO post about using wildcards for routing. In 2015, it wasn't possible. I was wondering whether routing with wildcard is possible in 2023.
Do you have any other suggestions to tackle this problem?
Please let me know if you need more info. from my side.
Thanks
The way you describe your use of custom routing does not make sense to me, so I am not sure if I have missed something.
Which version of Elasticsearch are you using?
How many primary shards does the index you are indexing into have?
How do you envision custom routing improving performance in your use case?
The point of custom routing is that you provide one routing ID and that this is used to determine which shard the document will be sent to at indexing time. This allows related data to be sent to the same shard. A common use case is to e.g. use a tenant ID as routing key when there are many tenants that share a large index with many primary shards.
When you query data related to just one routing ID you supply this with the query and this allows Elasticsearch to identify the single shard that holds this data and just query that. This reduces the number of shards queries, which is what can speed up queries, especially if your index has many primary shards. Although you can provide multiple routing IDs at query time, providing a large number of custom routing IDs at query time will likely mean that all shards will be queried, which negates the benefit of using custom routing and offers no performance benefit.
I have 459 primary shards in the cluster for now. I expect this routing to help with keys having small number of associated documents. For ex. for a key having 1000 assoc. documents, at max., they will be directed to only 10 shards. Searching all 459 shards is wasteful in such cases(> 90% of the time).
Correct. For keys with large documents, search performance w and w/o routing is going to be same. However, I don't have a way to tell that for those cases, the documents will be distributed to all shards with certainty. If I assume each request_id to take 16 bytes(16 chars), I can fit 448 request_ids + 448 chars for commas. This means keys with 448 groups of documents(44,800 docs if I group in 100s) can be searched with routing_ids without issues.
If I have a key with 60K documents, with 600(considering data overheads) primary shards, searching through all shards may be wasteful because at max., they all will be put in 448 shards.
Do you have a single index with 459 primary shards? That seems like a very odd number.
I still do not understand how you use multiple routing IDs in this way. Can you show an example of how you set routing ID when indexing a document and how you use the same routing IDs when querying?
Yes, it's an odd number but that's a baseline configuration I'm testing with.
When indexing 1000 documents for a key mykey, first 100 docs will be indexed with mykey-0 as routing_id, second 100 docs with mykey-1 and so on until the last 100 docs. with mykey-9 routing_id.
When searching, I pass "mykey-0,mykey-1....mykey-9" as value for the routing params.
One solution is to increase the header size limit, if doable. Is that possible?
Can the cluster pick the query params from the body instead from the request header?
Are these 1000 documents always queried together? What links the first 100 documents and separate them from the next 100? Why not index all 1000 with a single routing ID?
I perform exact KNN search with script_score, so all the documents of a given key is considered. The key used to pre-filter the documents is the link between different groups. The reason not to index all 1000 in one shard is to keep the number of docs per shard somewhat evenly distributed.
I think it is worth taking a step back and look at the problem you are looking to solve. I am not sure using routing in this unusual way is necessarily the way to go.
What is the size of the index (primary shards only)? What is the average shard size in GB?
How is the size of the data set expected to grow over time?
What is the performance problem you are trying to solve this way?
What is the size and specification of your cluster?
What is the number of concurrent queries you need to support?
What query latencies are you seeing? What is the target latency?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.