Hello,
First post here, sorry if this question is out of place !
We are using Multisearch API to try and limit the number of request we send to our Elasticsearch cluster. The gist of the process is as follows :
-
read a large data source (target is ~70 million lines, but we are testing with 100k at the moment)
-
as we read, we build a multisearch request, up to the biggest acceptable size (so as to minimize the number of requests, as explained above)
-
when the multisearch reach said size, we flush the request, and process the results
Now my question is about the "biggest acceptable size". As we tried running our program, we ran into this error :
EsRejectedExecutionException: rejected execution (queue capacity 1000)
After searching a bit, it appears that we set our dream number too high for Elasticsearch to gulp everything down. But after some trial and error on my PC (2 cores), I had to settle for 200 ; at 250 I get the exception. Given the default values for search threadpool are 3*#cores with a queue size of 1000, how come I saturate the queue with a batch size of 250 ? Is something amiss in our process, or does ES translate the 200 requests into something else internally ?
At any rate, a batch size of 200 probably won't cut it speed-wise if we are to run through 70 million lines, and it doesn't seem right to mess with the thread pool settings - I'm not even sure it would fix things. I was naively hoping we could have large batch size, similar to the bulk indexing we use, but it seems multisearch is a different beast altogether.
We do not have much leeway with the cluster. Read : we won't be able to put in more nodes, because we are only "experimenting" with ES. As such, is there anything in the Elasticsearch toolbox that could be useful for our use case ?
Thank you !