[SPARK] es.batch.write.retry.count negative value is ignored

First off, thanks for the well organized and formatted post - made reading it a breeze.

Your understanding is correct; when too many tasks from Hadoop/Spark are hitting Elasticsearch, the latter starts rejecting documents which the connector retries based on the given configuration.

  1. Note that this is per index not per cluster. That is, through the connector, one writes to one index (or multiple) but not the entire cluster. It might be that the index is spread across all Elasticsearch nodes but more often than not, that is not the case.
    As such, you won't see the load on all nodes. With Marvel you should be able to pinpoint what nodes are used. If you reduced your executor to 1, then really you have one Spark task hitting one node in Elastic which takes all the load.
    However it should be able to cope with it without any issues.

  2. es.batch.write.retry.count should work. Note that the connector has two types of retries:

  • one for network hiccups; in which case things are being retried after a timeout on a different node (the connection is unreliable with the initial node)
  • one for documents rejected. The connection works however the ES node keeps rejecting documents and as such, the connector retries them. This bit is configured through es.batch - it's a logical retry that relies on waiting and number of retries.
    A negative value should indeed retry over and over again, after waiting for 100ms. It's best to specify the time unit (100s vs 10s) since the default might not be always obvious.

More about them here.

  1. Indeed it should keep on retrying. It's not clear what causes the timeout after 10 min. Either way, this is not expected behavior and I would consider it a bug (even if the retries apply but something else leads to the disconnect, this should be properly documented).

  2. see above. Make sure you look at the node that is being used vs the whole cluster. Even then, ES is designed to cope with load as in, rather being completely overloaded and not be able to take any type of requests, it does provide itself with enough breathing space to take other requests and thus limit heavy indexing/reading.

  3. I would say it is. The pull architecture running in Elastic is not suitable since it cannot run in Elastic (as it affects stability) and it's the reason why rivers where deprecated. This pull architecture acts as a liaison between the two parties and it's pushback (retry, stop reading) should be good enough going forward.
    And if that's not the case, it should be addressed.

In other words, I think your post, understanding and expectations are valid; I'm not sure what is causing the current behavior but I'd consider it a bug and that it should be fixed.

The 10m timeout indicates somewhere there's a timeout/deadline that is being hit. I'm wondering if that's something controlled through Spark - namely a task has not completed in 10 min, we don't have any report of it being alive, it should be killed.

Can you please turn on logging on org.elasticsearch.hadoop.rest package all the way to TRACE and post the logs as a gist somewhere. Potentially turning on logging or keeping an eye on Spark logs might provide some useful info.

P.S. I'm currently on leave so I might not reply as fast as I'd like however I'll be checking this thread until my return.