Spark save to elasticsearch failed due to Connection error

I'm trying to index data in elasticsearch about 77M documents with 150 fields .
We dont have much computing/memory resource so our cluster is 3 nodes ( 48GB RAM /24 CPU and 6TB of storage )

I'm sending data from another spark cluster in another virtual network but the two networks are paired and I CAN PING ALL the els nodes from the spark cluster nodes .

the problem that I'm facing : is that at a certain amount of documents indexed ( about 8M ) spark cannot connect to els and it throws the following error :

Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-swspar.of12wietsveu3a3voc5bflf1pa.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.0.0.12:9200]] 
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:466)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:450)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:186)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:248)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:270)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:210)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:187)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
at org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

I don't know what can cause this. Is the cluster size ( RAM/CPU) not enougth or is there a special configuration for indexes with huge amount of data ?
what I'm sure about is that it's not a network problem .
ELS version : 6.2.4
thank you .

I notice that there's only one node in the list of nodes to try. With a cluster size of 3 nodes, this can only mean that you are running 1 shard, as the connector will filter out all shards that aren't hosting primaries while it writes. I would check to make sure that you are running closer to three shards in your index so that the connector has nodes to fall back to.

Additionally, the failure should include extra information in the logs at error level as well as trace level for why the connection failed. Can you check above this exception to see if there's anything there?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.