Hello There.
I use opensearch on AWS to store embeddings generated by a ML model. We need to use it for similarity search.
I use 5 nodes of 4 CPU, 8 GB Ram, 1 shard per index, 5 replicas. The data we currently have in the index are of 10 - 20 Gb Maximum.
Basically our documents look like:
{_id: 12312312, ..., _source: { image_id: 12312312, product_id: 12112, embedding: [vector: 2048]}
Everything is working fine, except for one of the simplest thing: Verify if the ids we give in our queries exist in Elasticsearch too.
I tried to find a way to tell Elastic to raise an error when some query ids don't exist in Elastic but there seems to be no way to do that. This behaviour is absolutely needed for my application
So the only possible way is to retrieve all of the ids from elastic and then do a comparison in the code.
The way I use mget is the following:
client.mget({"ids": all_ids[0:size]}, index=index, _source=False)
I benchmarked for different values for size, locally on my laptop, so there is probably some latency overhead, but some actually production measurements showed 800ms for 17000 docs. which is a lot.
took 2.1086799999999997s for 10000
took 2.4583180999999996s for 20000
took 3.088185199999998s for 30000
took 4.2098391000000035s for 40000
took 7.152762199999998s for 50000
These execution times are too high and I am sure that there can be some huge improvement to be down. I am not an expert in Elastic. To me it can be:
- the instances
- the number of nodes
- the number of shards
- the number of replicas
- the number of threads
- the JVM heap size
- the query actual parameters
- etc, etc, etc, ...
I have been working with elastic for a few months now, I read a lot of posts, and I still feel like I don't learn anything, nor have any sort of control on this platform.
If you could help me on my use case but also point me to the right direction in order to have more control over elastic, I'd be really happy.
I am the only one who is responsible for Elastic in my company so I'd like to have your point of view, whether this is feasible or if elastic need to be handled by a team of experts.
Thank you