Question concerning multisearch request size and thread pool


#1

Hello,

First post here, sorry if this question is out of place !

We are using Multisearch API to try and limit the number of request we send to our Elasticsearch cluster. The gist of the process is as follows :

  • read a large data source (target is ~70 million lines, but we are testing with 100k at the moment)

  • as we read, we build a multisearch request, up to the biggest acceptable size (so as to minimize the number of requests, as explained above)

  • when the multisearch reach said size, we flush the request, and process the results

Now my question is about the "biggest acceptable size". As we tried running our program, we ran into this error :

EsRejectedExecutionException: rejected execution (queue capacity 1000)

After searching a bit, it appears that we set our dream number too high for Elasticsearch to gulp everything down. But after some trial and error on my PC (2 cores), I had to settle for 200 ; at 250 I get the exception. Given the default values for search threadpool are 3*#cores with a queue size of 1000, how come I saturate the queue with a batch size of 250 ? Is something amiss in our process, or does ES translate the 200 requests into something else internally ?

At any rate, a batch size of 200 probably won't cut it speed-wise if we are to run through 70 million lines, and it doesn't seem right to mess with the thread pool settings - I'm not even sure it would fix things. I was naively hoping we could have large batch size, similar to the bulk indexing we use, but it seems multisearch is a different beast altogether.

We do not have much leeway with the cluster. Read : we won't be able to put in more nodes, because we are only "experimenting" with ES. As such, is there anything in the Elasticsearch toolbox that could be useful for our use case ?

Thank you !


(Mark Walkom) #2

You might be better off using the bulk API rather than multi search, I'd definitely try it either way.

Increasing the threadpools will only push out the problem, ultimately, given you have limited resources there's probably not a lot you can do.


#3

I am off-work at the moment, but I will definitely try this next monday. I thought the bulk API was only meant for create/index/update/delete, at least that's what I understood after reading these pages :

If we can indeed use any kind of request (or at least search requests) with the bulk API, I'm confident it would solve our problem. Thank you for the heads-up ! And on a side note, if this does work, I think the documentation could be rephrased so people do not make my mistake.


(Mark Walkom) #4

Uh yeah, sorry had a bit of a lapse there. You can't use bulk for this!


(system) #5