Hi there! I'm looking for some guidance around what's the most fit way to retrieve a large amount of documents (>=1000) when you know their ID. I'd like it to be fast, yet efficient and not put unnecessary strain on ES since this lookup routine will run very often.
The mget documentation doesn't give any pointers as to how it's implemented or how how it's different than using a _search with a terms query. Are there usecases fit to tackle one thing over the other? It's not clear.
Purely on speed - both approaches seem to be fast and working fine with a large amount of documents, but reading through this thread made me think that mget does N parallel individual "get" operations which seems unnecessary and inefficient.
For retrieving N (large amount of) documents I'd imagine a batched approach would be much more fit (instead of running N parallel gets, you run J batches of retrieval, J being a much smaller number than N.) With that in mind, I'm guessing a _search terms query does this "batched" approach and is more lean but what do I know! This is pure speculation so I'd like to ask for some guidance around what each thing does.
Great - thanks for the input. Do you know if the stress/load of the two is in the same ballpark?
I'm asking because sometimes speed is not synonymous with efficiency. For example, you can have a process that gains speed by opening 1000 concurrent threads and another process that's slower, but it uses a more conservative e.g. 5 threads. In this scenario I'd consider the first process to be more "stressful" to a database. Not sure if that thinking applies here.
I have asked internally about this question because I honestly don't know.
My guts feeling is that if the number of items to retrieve is low (let say 100), than mget might be better.
But if you compare a 10000 mget vs a 10000 search on a single shard, I'd expect search to be more efficient because of the Lucene optimizations behind the scene...
But I'm waiting for an answer about this very good question
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.