Connection error (check network and/or proxy settings)- all nodes failed

Subject:​​ Connection error "all nodes failed" writing Hive to ES 7.7.1 via Spark UPSERT

​Body:​

Hi Elastic community,

We're encountering persistent Connection error (check network and/or proxy settings) - all nodes failedwhen UPSERTing Hive data to Elasticsearch through Spark. Both clusters are verified operational.

​Environment Details:​

  • ​Elasticsearch​​: v7.7.1 (27 data nodes + 3 dedicated masters)

  • ​Network​​: Corporate internal network with domain resolution to 3 master nodes

  • ​Write Mode​​: Spark UPSERT operations

  • ​Data Volume​​: Batch size = 10k records/write

​Current Configuration:​

properties

复制

# Connectivity
target.es.nodes.wan.only = true  # Access via domain

# Write Tuning
target.es.batch.write.refresh = false
batchSize = 10000

# Spark Resources
ndi.spark.spark-argument.executor-cores = 2
ndi.spark.spark-argument.num-executors = 2
ndi.spark.spark-conf.spark.dynamicAllocation.maxExecutors = 4

​Troubleshooting Done:​

✓ Validated cluster health (Green status)

✓ Confirmed DNS resolution to master nodes

✓ Tested basic curl connectivity to ES masters

✓ Reduced batch size & executors to limit load

​Critical Questions:​

  1. ​Domain Configuration​​:

    • Is target.es.nodes.wan.only=truesufficient when using DNS resolution?

    • Should we explicitly specify target.es.nodeswith all master IPs?

  2. ​UPSERT-Specific Issues​​:

    • Could document version conflicts during UPSERT cause node-wide failures?

    • Is additional setup (e.g. es.mapping.id) required for UPSERTs vs inserts?

  3. ​Node Failure Diagnostics​​:

    • Where to find connection refusal details in ES 7.7.1 logs?

    • Recommended net.tcpsettings for heavy batch UPSERTs?

  4. ​Proxy Pitfalls​​:

    • How to verify if outbound proxy interferes with ES-Hadoop?

    • Required http.proxy*parameters if internal proxy exists?

Thanks for your expertise!

It would help if you formatted the post bnetter as it is hard to read.

This is a very old version that has been EOL a long time. I would recommend upgrading at least to the latest 7.17 release, but ideally to a version that is still supported.

Dedicated master nodes should NOT serve traffic. In a cluster like this all traffic should go directly to the data nodes, not the master nodes.

In addition to this it would be great if you could answer the following questions:

  • What is the hardware specification of the different nodes in the cluster? What type of storage are you using?
  • What is the average document size?
  • Do you have any complex or advnaced mapping features in use that could affect performance or resource usage, e.g. parent-child, nested documents or complex analysers?
  • How many indices and shards are you actively indexing into? Are these evently spread across the cluster?

The answer is likely in your spark executor log. But as the first reply said, it would be really helpful if you could format your post to make it more readable. And it would be wise to upgrade Elasticsearch.

1 Like

Is there any way to fix this without upgrading? This problem only started happening in August and never occurred before. Is there any parameter that can solve this problem?

We don’t know what the problem is. You need to look in the spark executor log. If there is nothing there, look in the spark task logs. If there is nothing there, look in the elasticsearch logs.

1 Like