Bulk master failure behaviour

Hi,

We are using a data generator to send records to a three node Elastic Search cluster. We are using the Java API to perform each 1000 bulk insert operation. We had previously used 2.1.1 and are now using 2.2.0. We are currently undertaking some failure testing whereby we kill a master or a standard node to see the impact (if any) and log these. We note down any operation failures along with those that have been successful and confirm the total number of written records using kopf.

We noticed previously in 2.1.1 that when we killed a master there were some missing records when we used a bulk insert. These always seemed to be less than 1000 and we presumed that this was because only part of the bulk operation being written and therefore successful confirmation could not be given. This resulted in some missing records in our counts at the end. With 2.2.0, we found that we had extra records which were recorded. This was because although the bulk operation returned a failure, a number of those records had actually been written.

Is this expected behaviour for both versions?

Hi Robert,

you mention that you're using the Java API and I assume you bulk index via theBulkProcessor API. If that's the case, we changed one thing between 2.1.x and 2.2, namely, we have introduced an automatic backoff in case we get an EsRejectedExecutionException. To be honest, I'd be a bit surprised if this would explain the behavior you're describing but you can try add the following line when creating the BulkProcessor (provided you even use it):

builder.setBackoffPolicy(BackoffPolicy.noBackoff())

This will restore the previous behavior of BulkProcessor.

Anyway, is there any chance you can show (a potentially stripped-down version of) the source code of your test client? Then it might be easier to spot / reproduce the problem you're describing.

Daniel

Do you use random doc IDs with more than one concurrent bulk requests (async) and with default quorum consistency?

In that case, duplicate documents can occur, because the new 2.2 back-off feature assumes that all bulk items with EsRejectedExecutionException can be safely repeated, assuming they never reached a shard.

If you want to avoid duplicate documents you should either use non-random doc IDs or you should disable the back-off feature. I don't prefer the new feature, because it hides errors and adds complexity to the client.

Hi All,

Many thanks for your responses and apologies for not getting back to you sooner.

I will try setting the back off policy to see if I can reproduce both behaviours again.

Just to let you know we do use random doc ID's along with concurrent bulk requests so it sounds like what you describe is happening!

It goes along way to explaining it and allows me to provide some answers when asked about this in the future.

Thanks again,

Robert