Network related issue retry from spark to ES

Hello All.

I am using Spark native library to connect to Elastic Search. for non network issues we have the batch retry config, i read in post : [SPARK] es.batch.write.retry.count negative value is ignored

I am deliberately giving an invalid ES IP and the spark errors out with the below trace, since we are running in muti cluster mode, catching the exception is not feasible (have tried) as it goes into its own executor. Is there a config to set network related retry count? FYI i am using 2.3.2 Elastic Seacrh.

Any inputs related to this would be very helpful. :slight_smile:

Exception Stack trace.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 1 times, most recent failure: Lost task 2.0 in stage 1.0 (TID 3, localhost): org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:190)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:379)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.11:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:142)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:434)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:414)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:418)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:122)
at org.elasticsearch.hadoop.rest.RestClient.esVersion(RestClient.java:564)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:178)
... 10 more

The es.batch.write.retry.count is only respected when the response from a server is either an HTTP 429 or HTTP 503 response code, which usually denotes that the server is too busy and has chosen to ignore the request in order to exert backpressure on the writers. All other HTTP Failure Responses are treated as if they will not succeed no matter how many executions are performed.

The retry policy that is configured into the http client library can be found here on github if you are interested in the usage of that configuration setting in verbatim.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.