Bulk index API back pressure

Hi there,

We're using the bulk index REST api to insert and index new documents. Our configuration is that we have 3 ES instances running as 1 cluster and then n clients sending the bulk index request (where n can be, say 10). Each client will only send one request at a time and waits for either 1000 documents or 1 second, whatever comes first. No further request is sent from each client until a successful response for the bulk index is received from ES. Requests are also sent on a round-robin basis from each client. The clients know nothing of each other's activities.

Our assumption is that ES will not respond successfully if it cannot handle the request i.e. we're assuming that none of the documents will be dropped on the floor. Successful responses therefore signal flow control.

Is this assumption correct?

Thanks in advance.

Kind regards,
Christopher

1 Like

Not entirely correct. Look into threadpools to get a better idea of how this works.

Also you shouldn't be using a hard 1K limit, the batch size is dependant on your doc type and size as well as your node resources. The best size requires you to test.

Thanks Mark.

So if we set the bulk queue size to 0 then I think we'll get the back pressure we're looking for. We retry requests if they fail. Does that make sense? In essence, I don't want anything queued... if ES cannot handle the load being thrown at it, I want it to tell me...

On the 1K hard limit, we have no idea what size of documents are going to come through... Should we instead look at a threshold of, say, 5MB at a time?

Cheers,
-C

You can also do a threshold, 5MB is a sane place to start. But again I'd test a few different sizes to get the best performance.

In your use case it might be what you setting the threadpool to 0 may make sense, but it's not anything we'd recommend doing.

Great. Thanks.

Given a client that will retry, and has a buffering strategy, why wouldn't you recommend this?

What if the client resend doesn't happen till N seconds/minutes after the last batched has finished; With threadpools there is a backlog but you fill that potential resource gap.
What if your app gets a massive influx of events, do you have a queuing system there to deal with that?

Basically they exist to provide a balance.

After 5MB or 1 second since the last send the client will attempt to send again... in many cases then, this will mean that the client will post again immediately.

Given our client-side buffering, round-robin and a cluster size of 3 it'll be interesting to see how we go. I'll report my findings if you like.

We're collecting log events - think something like logstash. However we have a "log collapsing buffer" strategy where, if certain client-side buffer thresholds are exceeded, we'll start rolling the oldest entries up into one in reverse order of severity.

There is indeed the possibility that log events will get rolled up, but in practice we don't see that often.

Thanks for listening. This dialogue has been very useful for me.

No worries :slight_smile:

I'm getting the feeling that the queue size has some other influence here.

Say we have one index with one shard and no replicas (dev mode). Given a 4 core machine, that means a thread pool size of 4 by default. We set the queue size to 0. Thus I'd expect 4 bulk requests to be handled concurrently and anything else to be rejected.

What we're seeing though is that the majority of bulk requests are being rejected; in fact it looks as though all but the first 4 get through and then the remainder are rejected.

Any further thoughts on this? Again, our objective is for ES to push back and reject requests unless it is in a position to process them immediately.