Bulk EsRejectedExecutionException question - What's the real bottleneck?

I get the general concept as to why this exception happens and handle catching the exception on the client so I can resubmit the rejected documents.. but ...

In a multi-node cluster, what specifically does this exception imply?

Let's assume there's a hypothetical cluster with nodes A, B, and C with a single index with 6 shards (2 shards per node).


When a bulk request to node A results in an EsRejectedExecutionException...

  • Does this imply node A is simply unable to distribute the documents from the bulk requests fast enough to the appropriate nodes for indexing?

if so, I would think a solution would be smaller batch sizes such that the previous batch sent to node A was partially sent to node A and another node (B or C)

Also, I suspect this would also imply that the documents rejected in the bulk execution against node A could immediately be reissued via another bulk request to nodes B or C

or

  • Does this imply node A itself may have an indexing backlog that's impacting node A's ability to service bulk requests?

If so, perhaps either the resources on the node are insufficient or the indices being written to have insufficient shards (and we're nor paralleling work sufficiently)?


I have good reason based on various monitoring points that my cluster has sufficient resources (I'm not pushing any particular memory, CPU, or disk IOPS limits) so I suspect the issue must lie in my index configuration (shard #) or batch size (perhaps simply too large for a single node to process at once?).

My clusters are currently still back on ES 1.7.1 although I suspect that may not be relevant to this topic.

Any input on the above would be greatly appreciated.

Your answer is a bit of both.
It means that the node A may need more resources, or your bulk requests are too large. And sending a request to another node may indeed get it processed.

There are always improvements in performance in newer versions, definitely upgrade :slight_smile:

As always. thanks for the feedback.

Thanks, that helps. My current logic for handling the rejections has a delay/sleep interval which it sounds like I can eliminate as long as my next bulk attempt is bound for another node which has sufficient space in its bulk queue (which will happen give my usage of the Java Jest client which round robins requests among the configured ES hosts)

I'll also try to play with my bulk sizes to see if that helps.

It's on the list; luckily I have a rolling window of time based indices which will allow me to transition to 2.X and then eventually 5.X as my old indices roll off.

Definitely :smiley:
We suggest keeping them to about 5MB, irrespective of count. That's a really rough guide though and you can and should test different sizes as your data and cluster is unique.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.