All Nodes Failed Exception

Hi,

We are getting the following error:

Error summary: EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]

Previously we were able to solve it with AWS support team (we use AWS EMR) by just increasing 'es.batch.size.entries' from '5000' up to '50 000'. Now we obviously tried doing the same and increased it from '50 000' to '500 000' but it didn't work this time.

To read some more details on what we tried before when we faced the same issue:

Not sure if its related but we need to have 'es.nodes.wan.only' set to 'True' to allow https communication. Supposedly it significantly affects ES performance, so I have a suspicion that it may be related here.

I also wonder (although its a separate question) why Elasticsearch was not designed for https (since such a situation shows that it was not)? Shouldn't security be a priority?

Could you share the output of GET / from your cluster?

You mean this:

"version" : {
"number" : "6.8.0",
"build_flavor" : "oss",
"build_type" : "zip",
"build_hash" : "build_hash",
"build_date" : "2021-04-21...",
"build_snapshot" : false,
"lucene_version" : "7.7.0",
"minimum_wire_compatibility_version" : "5.6.0",
"minimum_index_compatibility_version" : "5.0.0"
},

?

Yeah, but without the redactions please.

I can't provide hashes and cluster names for security reasons, why do you need this info?

Is this related to an AWS Elasticsearch service cluster? If not, how is the cluster secured?

Yeah, same issue. Before ultimately we solved it by increasing 'es.batch.size.entries' but we hit the same issue again now.

Increasing the batch size will at some point start creating more problems than it solves and it seems like you have passed that point. AWS Elasticsearch service are running a fork with their own plugins for security. I have no experience with these, how they behave or use resources under load.

The build hash and build date aren't security-sensitive, but they are important to being able to offer you help. The cluster name isn't especially important, I'm just trying to determine the specific version that you're running. Assuming it is 6.8.0 the build hash and date should match this:

{
  "name" : "hPjRkGk",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "mJY4-VYjS-6off9dFlScZw",
  "version" : {
    "number" : "6.8.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "65b6179",
    "build_date" : "2019-05-15T20:06:13.172855Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.0",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Do they? The truncated build date you shared seems wildly wrong. If they don't match then that indicates you're not running a real Elasticsearch build. Maybe you're looking for AWS Opensearch help instead? Opensearch shares history with Elasticsearch but it's quite different these days so we can't really help much with these issues.

Elasticsearch comes with HTTPS support out-of-the-box these days, but if you're not actually running Elasticsearch then that doesn't seem relevant.

I see. hash is '8169a24', date is '2021-04-21T19:26:55.782637Z'.

I am not sure what you mean by 'not actually running Elasticsearch'. We have a Hail python function that does the writing operation to AWS ES cluster:

https://hail.is/docs/0.2/methods/impex.html#hail.methods.export_elasticsearch

Which ultimately comes down to: elasticsearch-hadoop/EsRDDWriter.scala at master · elastic/elasticsearch-hadoop · GitHub

As you are running a quite old version you are using Elasticsearch, but a forked version (see different build hash) AWS uses with AWS plugins that a lot of people hanging out here are not necessarily familiar with.

To know why it is failing it would be useful to see the full output of the cluster stats API and the Elasticsearch logs, but even then it would be hard to tell what impact custom plugins and changes may have.

Here is is:

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "cluster_name",
  "cluster_uuid" : "cluster_uuid",
  "timestamp" : 1623445052654,
  "status" : "green",
  "indices" : {
    "count" : 12,
    "shards" : {
      "total" : 79,
      "primaries" : 57,
      "replication" : 0.38596491228070173,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 21,
          "avg" : 6.583333333333333
        },
        "primaries" : {
          "min" : 1,
          "max" : 21,
          "avg" : 4.75
        },
        "replication" : {
          "min" : 0.0,
          "max" : 19.0,
          "avg" : 1.8333333333333333
        }
      }
    },
    "docs" : {
      "count" : 68183560784,
      "deleted" : 1508157701
    },
    "store" : {
      "size_in_bytes" : 4474843386856
    },
    "fielddata" : {
      "memory_size_in_bytes" : 8745456,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 38650249280,
      "total_count" : 123179758,
      "hit_count" : 303656,
      "miss_count" : 122876102,
      "cache_size" : 13866,
      "cache_count" : 13953,
      "evictions" : 87
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 4554,
      "memory_in_bytes" : 951712008,
      "terms_memory_in_bytes" : 102031744,
      "stored_fields_memory_in_bytes" : 771786760,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 8960,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 77884544,
      "index_writer_memory_in_bytes" : 0,
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set_memory_in_bytes" : 521170320,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "data" : 20,
      "coordinating_only" : 0,
      "master" : 3,
      "ingest" : 20
    },
    "versions" : [ "6.8.0" ],
    "os" : {
      "available_processors" : 326,
      "allocated_processors" : 326,
      "names" : [ {
        "count" : 23
      } ],
      "pretty_names" : [ {
        "count" : 23
      } ],
      "mem" : {
        "total_in_bytes" : 2588452057088,
        "free_in_bytes" : 219945406464,
        "used_in_bytes" : 2368506650624,
        "free_percent" : 8,
        "used_percent" : 92
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 0
      },
      "open_file_descriptors" : {
        "min" : 1804,
        "max" : 2151,
        "avg" : 2084
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 1090016953,
      "mem" : {
        "heap_used_in_bytes" : 255663998232,
        "heap_max_in_bytes" : 639540002816
      },
      "threads" : 4996
    },
    "fs" : {
      "total_in_bytes" : 74829120610304,
      "free_in_bytes" : 70334216101888,
      "available_in_bytes" : 70333830225920
    },
    "network_types" : {
      "transport_types" : {
        "com.amazon.opendistroforelasticsearch.security.ssl.http.netty.OpenDistroSecuritySSLNettyTransport" : 23
      },
      "http_types" : {
        "filter-jetty" : 23
      }
    }
  }
}

Also full crash log:

hail.utils.java.FatalError: EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]

Java stack trace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 221 in stage 13.0 failed 3 times, most recent failure: Lost task 221.2 in stage 13.0 (TID 18601, ip-172-23-79-169.ec2.internal, executor 30): is.hail.relocated.org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]
	at is.hail.relocated.org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:364)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:216)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:185)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:460)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:106)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:187)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168)
	at is.hail.relocated.org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$.saveToEs(EsSparkSQL.scala:83)
	at is.hail.relocated.org.elasticsearch.spark.sql.package$SparkDataFrameFunctions.saveToEs(package.scala:49)
	at is.hail.io.ElasticsearchConnector$.export(ElasticsearchConnector.scala:44)
	at is.hail.io.ElasticsearchConnector$.export(ElasticsearchConnector.scala:20)
	at is.hail.io.ElasticsearchConnector.export(ElasticsearchConnector.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

is.hail.relocated.org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://search-seqr-gris-prod-65wdlm6cncfxo5d326vkd4z6be.us-east-1.es.amazonaws.com:443]]
	at is.hail.relocated.org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:364)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:216)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.tryFlush(BulkProcessor.java:185)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.flush(BulkProcessor.java:460)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.bulk.BulkProcessor.add(BulkProcessor.java:106)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:187)
	at is.hail.relocated.org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:168)
	at is.hail.relocated.org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at is.hail.relocated.org.elasticsearch.spark.sql.EsSparkSQL$$anonfun$saveToEs$1.apply(EsSparkSQL.scala:101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

These are far as I can tell specific to AWS Elasticsearch service and could very well be affecting the behaviour you are seeing. When I asked for logs I meant Elasticsearch server logs, not client logs.

Apart from this the statistics look fine as far as I can tell.

Yes "AWS Elasticsearch" isn't really Elasticsearch, it has a number of private AWS-specific patches that change its behaviour and performance in ways we cannot see. It's pretty confusing, but at least it's being renamed to AWS Opensearch which should make the distinction a bit clearer.

2 Likes

As I mentioned in the previous thread the easiest, and possibly only, way to determine the impact the AWS changes to their fork and plugins have would be to set up a temporary cluster based on the default distribution and test the workload against this. This would give full access to logs and allow us to help you analyze the situation and tune the cluster as well as let you to determine the potential impact of coordinating-only nodes. If you do not want to roll it yourself you could also set up a cluster on Elastic Cloud and test against that. If your workload is update heavy you may also benefit from upgrading to the latest version of Elasticsearch as I believe a number of improvements have been made since the version you are using.

It is a lot of work but barring this I suspect your best bet is AWS Support as we do not have enough information or knowledge of the code to help. The OpenDistro community might also be able to help and be aware of potential issues or limitations with the version you are using.

1 Like