All Nodes Failed Exception

NLSVTN · June 10, 2021, 9:40pm

Hi,

We are getting the following error:

Error summary: EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]

Previously we were able to solve it with AWS support team (we use AWS EMR) by just increasing 'es.batch.size.entries' from '5000' up to '50 000'. Now we obviously tried doing the same and increased it from '50 000' to '500 000' but it didn't work this time.

To read some more details on what we tried before when we faced the same issue:

Not sure if its related but we need to have 'es.nodes.wan.only' set to 'True' to allow https communication. Supposedly it significantly affects ES performance, so I have a suspicion that it may be related here.

I also wonder (although its a separate question) why Elasticsearch was not designed for https (since such a situation shows that it was not)? Shouldn't security be a priority?

DavidTurner · June 11, 2021, 6:40am

Could you share the output of GET / from your cluster?

NLSVTN · June 11, 2021, 1:35pm

You mean this:

"version" : {
"number" : "6.8.0",
"build_flavor" : "oss",
"build_type" : "zip",
"build_hash" : "build_hash",
"build_date" : "2021-04-21...",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},

?

DavidTurner · June 11, 2021, 2:16pm

Yeah, but without the redactions please.

NLSVTN · June 11, 2021, 2:27pm

I can't provide hashes and cluster names for security reasons, why do you need this info?

Christian_Dahlqvist · June 11, 2021, 2:29pm

Is this related to an AWS Elasticsearch service cluster? If not, how is the cluster secured?

NLSVTN · June 11, 2021, 2:45pm

Yeah, same issue. Before ultimately we solved it by increasing 'es.batch.size.entries' but we hit the same issue again now.

Christian_Dahlqvist · June 11, 2021, 2:59pm

Increasing the batch size will at some point start creating more problems than it solves and it seems like you have passed that point. AWS Elasticsearch service are running a fork with their own plugins for security. I have no experience with these, how they behave or use resources under load.

DavidTurner · June 11, 2021, 5:31pm

The build hash and build date aren't security-sensitive, but they are important to being able to offer you help. The cluster name isn't especially important, I'm just trying to determine the specific version that you're running. Assuming it is 6.8.0 the build hash and date should match this:

{
  "name" : "hPjRkGk",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "mJY4-VYjS-6off9dFlScZw",
  "version" : {
    "number" : "6.8.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "65b6179",
    "build_date" : "2019-05-15T20:06:13.172855Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Do they? The truncated build date you shared seems wildly wrong. If they don't match then that indicates you're not running a real Elasticsearch build. Maybe you're looking for AWS Opensearch help instead? Opensearch shares history with Elasticsearch but it's quite different these days so we can't really help much with these issues.

Elasticsearch comes with HTTPS support out-of-the-box these days, but if you're not actually running Elasticsearch then that doesn't seem relevant.

NLSVTN · June 11, 2021, 8:15pm

I see. hash is '8169a24', date is '2021-04-21T19:26:55.782637Z'.

NLSVTN · June 11, 2021, 8:19pm

I am not sure what you mean by 'not actually running Elasticsearch'. We have a Hail python function that does the writing operation to AWS ES cluster:

https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_elasticsearch

Which ultimately comes down to: elasticsearch-hadoop/EsRDDWriter.scala at master · elastic/elasticsearch-hadoop · GitHub

Christian_Dahlqvist · June 11, 2021, 8:45pm

As you are running a quite old version you are using Elasticsearch, but a forked version (see different build hash) AWS uses with AWS plugins that a lot of people hanging out here are not necessarily familiar with.

To know why it is failing it would be useful to see the full output of the cluster stats API and the Elasticsearch logs, but even then it would be hard to tell what impact custom plugins and changes may have.

NLSVTN · June 11, 2021, 8:59pm

Here is is:

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "cluster_name",
  "cluster_uuid" : "cluster_uuid",
  "timestamp" : 1623445052654,
  "status" : "green",
  "indices" : {
    "count" : 12,
    "shards" : {
      "total" : 79,
      "primaries" : 57,
      "replication" : 0.38596491228070173,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 21,
          "avg" : 6.583333333333333
        },
        "primaries" : {
          "min" : 1,
          "max" : 21,
          "avg" : 4.75
        },
        "replication" : {
          "min" : 0.0,
          "max" : 19.0,
          "avg" : 1.8333333333333333
        }
      }
    },
    "docs" : {
      "count" : 68183560784,
      "deleted" : 1508157701
    },
    "store" : {
      "size_in_bytes" : 4474843386856
    },
    "fielddata" : {
      "memory_size_in_bytes" : 8745456,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 38650249280,
      "total_count" : 123179758,
      "hit_count" : 303656,
      "miss_count" : 122876102,
      "cache_size" : 13866,
      "cache_count" : 13953,
      "evictions" : 87
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4554,
      "memory_in_bytes" : 951712008,
      "terms_memory_in_bytes" : 102031744,
      "stored_fields_memory_in_bytes" : 771786760,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 8960,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 77884544,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 521170320,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "data" : 20,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 20
    },
    "versions" : [ "6.8.0" ],
    "os" : {
      "available_processors" : 326,
      "allocated_processors" : 326,
      "names" : [ {
        "count" : 23
      } ],
      "pretty_names" : [ {
        "count" : 23
      } ],
      "mem" : {
        "total_in_bytes" : 2588452057088,
        "free_in_bytes" : 219945406464,
        "used_in_bytes" : 2368506650624,
        "free_percent" : 8,
        "used_percent" : 92
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : 1804,
        "max" : 2151,
        "avg" : 2084
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 1090016953,
      "mem" : {
        "heap_used_in_bytes" : 255663998232,
        "heap_max_in_bytes" : 639540002816
      },
      "threads" : 4996
    },
    "fs" : {
      "total_in_bytes" : 74829120610304,
      "free_in_bytes" : 70334216101888,
      "available_in_bytes" : 70333830225920
    },
    "network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 23
      },
      "http_types" : {
        "filter-jetty" : 23
      }
    }
  }
}

Also full crash log:

hail.utils.java.FatalError: EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 221 in stage 13.0 failed 3 times, most recent failure: Lost task 221.2 in stage 13.0 (TID 18601, ip-172-23-79-169.ec2.internal, executor 30): is.hail.relocated.org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]
	at is.hail.relocated.org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:364)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:216)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:185)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:460)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:106)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:187)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168)
	at is.hail.relocated.org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:83)
	at is.hail.relocated.org.elasticsearch.spark.sql.package$SparkDataFrameFunctions.saveToEs(package.scala:49)
	at is.hail.io.ElasticsearchConnector$.export(ElasticsearchConnector.scala:44)
	at is.hail.io.ElasticsearchConnector$.export(ElasticsearchConnector.scala:20)
	at is.hail.io.ElasticsearchConnector.export(ElasticsearchConnector.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

is.hail.relocated.org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]
	at is.hail.relocated.org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:364)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:216)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:185)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:460)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:106)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:187)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168)
	at is.hail.relocated.org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Christian_Dahlqvist · June 11, 2021, 9:07pm

NLSVTN:

"network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 23
      },
      "http_types" : {
        "filter-jetty" : 23
      }
    }

These are far as I can tell specific to AWS Elasticsearch service and could very well be affecting the behaviour you are seeing. When I asked for logs I meant Elasticsearch server logs, not client logs.

Apart from this the statistics look fine as far as I can tell.

DavidTurner · June 11, 2021, 9:19pm

Yes "AWS Elasticsearch" isn't really Elasticsearch, it has a number of private AWS-specific patches that change its behaviour and performance in ways we cannot see. It's pretty confusing, but at least it's being renamed to AWS Opensearch which should make the distinction a bit clearer.

Christian_Dahlqvist · June 12, 2021, 5:43am

As I mentioned in the previous thread the easiest, and possibly only, way to determine the impact the AWS changes to their fork and plugins have would be to set up a temporary cluster based on the default distribution and test the workload against this. This would give full access to logs and allow us to help you analyze the situation and tune the cluster as well as let you to determine the potential impact of coordinating-only nodes. If you do not want to roll it yourself you could also set up a cluster on Elastic Cloud and test against that. If your workload is update heavy you may also benefit from upgrading to the latest version of Elasticsearch as I believe a number of improvements have been made since the version you are using.

It is a lot of work but barring this I suspect your best bet is AWS Support as we do not have enough information or knowledge of the code to help. The OpenDistro community might also be able to help and be aware of potential issues or limitations with the version you are using.

system · July 10, 2021, 5:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting EsHadoopNoNodesLeftException Elasticsearch es-hadoop	5	1340	May 10, 2017
Insert into elastic from spark: Connection error - all nodes failed Elasticsearch es-hadoop	3	1674	April 12, 2017
Org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed Elasticsearch	2	1978	July 6, 2017
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes Elasticsearch es-hadoop	2	4504	July 19, 2018
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried Elasticsearch es-hadoop	4	9039	July 6, 2017

All Nodes Failed Exception

Related topics