Best practices for bulk indexing/retry handling?

We're using BulkProcessor for indexing and stumbled upon a problem when doing rolling restarts of our Elasticsearch cluster consisting of three nodes.

It appears that once the master node is going down, we get
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]
and obviously fail to index bulk requests.

Two questions here:

  1. Does this need to be handled on the client side or is there any way to avoid this via cluster configuration?
  2. If we need to handle this in code, are there any best practices or suggested ways of handling failed bulks? We want to retry the failed data, but this would probably involve some not-trivial implementation to re-queue this data, delay or exponential backoff, combining retry data with newly arriving index requests, etc

Thanks

  1. You need to handle this for yourself. A cluster is not aware that there are clients that should be suspended

  2. You have to handle failed bulks for yourself. It should be fairly easy to add a suspend/resume logic. If not client side, a plugin at server side for book-keeping about client IDs would be an option. Suspend = tell all clients they have to write to local file, resume = tell all clients they can replay from local file. A sort of client translog, neglecting the edge case of full disks, in that case you should better halt all your clients before rolling updates.

There is a new logic for rejection exception backoff handling in BulkProcessor of ES 2.2+ which is enabled by default, but it assumes a cluster that can react properly and so it's kind of useless in situations where the cluster is degrading.

1 Like