Mget too slow for large amount of documents

Hello There.

I use opensearch on AWS to store embeddings generated by a ML model. We need to use it for similarity search.
I use 5 nodes of 4 CPU, 8 GB Ram, 1 shard per index, 5 replicas. The data we currently have in the index are of 10 - 20 Gb Maximum.

Basically our documents look like:
{_id: 12312312, ..., _source: { image_id: 12312312, product_id: 12112, embedding: [vector: 2048]}

Everything is working fine, except for one of the simplest thing: Verify if the ids we give in our queries exist in Elasticsearch too.

I tried to find a way to tell Elastic to raise an error when some query ids don't exist in Elastic but there seems to be no way to do that. This behaviour is absolutely needed for my application

So the only possible way is to retrieve all of the ids from elastic and then do a comparison in the code.

The way I use mget is the following:

client.mget({"ids": all_ids[0:size]}, index=index, _source=False)

I benchmarked for different values for size, locally on my laptop, so there is probably some latency overhead, but some actually production measurements showed 800ms for 17000 docs. which is a lot.

took 2.1086799999999997s for 10000
took 2.4583180999999996s for 20000
took 3.088185199999998s for 30000
took 4.2098391000000035s for 40000
took 7.152762199999998s for 50000

These execution times are too high and I am sure that there can be some huge improvement to be down. I am not an expert in Elastic. To me it can be:

  • the instances
  • the number of nodes
  • the number of shards
  • the number of replicas
  • the number of threads
  • the JVM heap size
  • the query actual parameters
  • etc, etc, etc, ...

I have been working with elastic for a few months now, I read a lot of posts, and I still feel like I don't learn anything, nor have any sort of control on this platform.

If you could help me on my use case but also point me to the right direction in order to have more control over elastic, I'd be really happy.
I am the only one who is responsible for Elastic in my company so I'd like to have your point of view, whether this is feasible or if elastic need to be handled by a team of experts.

Thank you

Welcome!

We don't support OpenSearch here. It's another project which is diverging from Elasticsearch.

We can only help if you are using Elasticsearch.

1 Like

Thank you for your answer.

The fact that OpenSearch is used here is not the main concern.

It is still based on Elasticsearch. It just has a K nearest neighbor plugin that gives you advanced functionnalities.

My question concerns the Mget which a standard Elasticsearch command.

Hence, I need to understand if the overall Elasticsearch installation is the issue here or not. It can be the shards, the replicas or even the instances, as I said earlier.

Thanks

To give any advice (assuming it still behaves like Elasticsearch) I think more information is needed. It would help if you could answer the following questions:

  • How many indices do you have in the cluster?
  • What is the the total data volume stored per node?
  • What type of storage are you using? What does iowait look like? Does it look like disk performance might be limiting performance?
  • What does CPU usage look like?
  • Is there any indication of frequent or slow GC in the logs?
  • How many of these mget queries are you issuing to the cluster per second at peak load?
  • What other load is the cluster under? Do you update or index data frequently?

Hello Christian and thank you for your response.
To answer your questions:

  • We have 3 indices in the cluster (embeddings, embeddings-qa, embeddings-dev)
  • 4.1 Gb per node
  • As we use AWS to spawn the OpenSearch cluster, we have to use EBS SSD storage. There are not much options when creating a cluster (Actually it's the only option). There seems to be no way to see disk performance since it's a fully managed cluster.
  • CPU Usage looks normal, never had peaks.
  • Regarding GC, I have 2 graphs: JVMGCYoungCollectionCount & JVMGCYoungCollectionTime. Something interesting here: over the last 2 weeks, the count and the times raised significantly. Count raised from 2000 to 6000 and time raised from 25000 to 150000+ ms.
  • Currently we don't have production traffic rushing to our cluster since we're just in QA phase.
  • The cluster isn't under any specific load. We just use it to insert our vectors at times and run our searches.

Besides the disk performance, I hope I answered your questions and that you have a better picture

OK, I assume that limits analysis. What type of EBS storage? What is the size of the volume? Try to avoid gp2 EBS as it has limited IOPS for small disk sizes.

If it turns out storage is limiting performance you might be able to increase the amount of RAM so your entire data set could fit in the OS page cache. If you self hosted your cluster you could probably keep your current heap size if it is sufficient, but as it is a hosted solution the heap size is likely to grow with RAM.

Apart from that there is not much I can help with. I would recommend reaching out to the Opensearch community and/or AWS support.

Thank you for the response! Regarding the EBS type, as I said, in the cluster configuration, we can only give EBS for most instances, and for instances which are IO optimized there seems to be more efficient storage solution. The only type of EBS we can choose is SSD, with not much options. We cannot specify the IOPS number, though it is specified in their documentation...

Regarding Heap size, I have 16Gb max for each node and after checking, not all of it is used so I assume it is already memory optimized.

What should I do about the GC ?

Apart from that I don't really see anything in my configuration that might hurt the performance

Thank you for your help

I am not familiar with the AWS Opensearch service nor what monitoring it offers. You need to identify what is limiting performance, which can be difficult if proper monitoring is not available. Is it storage performance? One way to check this would be to scale up the cluster until you know all you data can be cached and then run your query a few times. If it speeds up after a few runs when it is all cached it could point to the storage.

Apart from that I do not have any suggestions.

Thank you Christian for the precious help

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.