Hello,
I use elasticsearch as a vector db for 1024 dim vectors. My initial setup would be around 20 million vectors, but it may increase to 100 milion vectors over time. However i always prefilter the vector search to a specific client_id - i have around 10k clients, they can have between 100 and +10k documents. The daily volume is about 300k writes and searches. I am wondering if it is better to use well known routing with static index of 20-30 shards or would you suggest to implement ILM policy with single shards and rollover after 30-40 GB. To sum up my conern is to implement ILM for growing index or routing to eliminate redundant shard searches. Thanks in advance for help!
Do you just add new documents or do you also perform updates and/or deletes?
Do you have a specified retention period for your data set?
I add new documents on a daily basis and i also update them and delete - it bases on the user action in the system. Retention period is not an issue, because this is only a database for ML.
In that case I think ILM and time-based indices seem to be a bad fit as it complicates updates and deletes and you do not want to use it to manage retention (which is it’s main purpose). I would go with a reasonably large number of primary shards together with routing based on the client ID.
Okay, Thank you for your opinion. I would probably do around 20 shards so they start with 5 GB and reach at most 30 GB.
As your clients differ in size it is possible you will get an uneven shard size distribution, so it may be worthwhile going a bit higher on shard count for that reason. At least test it if you can.