Nest client not retrying with other nodes when master node abruptly crashes

We're using a three node Elasticsearch cluster v6.5.4 which seems to work fine.
All master eligible.
The clients use NEST for communication.

But when our at the time master node died because of power shortage all clients that were in a query got this exception.

Elasticsearch.Net.ElasticsearchClientException: Failed to ping the specified node.. Call: Status code unknown from: HEAD / ---> Elasticsearch.Net.PipelineException: Failed to ping the specified node. ---> Elasticsearch.Net.PipelineException: An error occurred trying to read the response from the specified node. ---> System.Net.WebException: The request was aborted: The request was canceled.
at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)

Normally the client retries the query with the other living nodes but not when this happens.
We have setup a test environment in Azure and can reproduce it when we stop the VM with no shutdown(hard stop). If we shut down gracefully it works, the client automatically connects to another node without exception.

Is this expected behavior so we must have our own retry logic or is it something that we've configured wrong?

Is the client configured with the IP addresses of all nodes in the cluster?

Have you got discovery.zen.minimum_master_nodes set to 2 in the cluster?

If you look at the logs from all the nodes in the cluster, how long does it take for a new master to get elected?

I would also recommend upgrading as version 6.5.4 is quite old.

discovery.zen.minimum_master_nodes are set to 2 and the clients know all of the nodes in the cluster.
Tried with both static and sticky pool with no difference.

This is in an production environment and i don't think the Nest client for 6.5 is compatible with newer versions.

I've checked the logs and even though the master is elected quicky the state is not published directly.
It got timeout waiting for node 2, which should be online.

Must have missed lowering a timeout, i will change that and try again.

[2021-05-11T12:49:00,090][WARN ][o.e.d.z.ZenDiscovery ] [xxx-node-3] master left (reason = transport disconnected), current nodes: nodes:
{xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true}, master
{xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true}, local
{xxx-node-2}{LyQTUunOSbOFNURM3RcmCw}{85NzxYx0RBKZTLagRGdSTg}{10.1.0.5}{10.1.0.5:9300}{xpack.installed=true}
[2021-05-11T12:49:03,267][INFO ][o.e.c.s.MasterService ] [xxx-node-3] zen-disco-elected-as-master ([1] nodes joined)[{xxx-node-2}{LyQTUunOSbOFNURM3RcmCw}{85NzxYx0RBKZTLagRGdSTg}{10.1.0.5}{10.1.0.5:9300}{xpack.installed=true}], reason: new_master {xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true}
[2021-05-11T12:49:21,119][WARN ][o.e.c.NodeConnectionsService] [xxx-node-3] failed to connect to node {xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true} (tried [1] times)
2021-05-11T12:49:33,287][WARN ][o.e.d.z.PublishClusterStateAction] [xxx-node-3] timed out waiting for all nodes to process published state [98] (timeout [30s], pending nodes: [{xxx-node-2}{LyQTUunOSbOFNURM3RcmCw}{85NzxYx0RBKZTLagRGdSTg}{10.1.0.5}{10.1.0.5:9300}{xpack.installed=true}])
[2021-05-11T12:49:33,287][WARN ][o.e.d.z.PublishClusterStateAction] [xxx-node-3] timed out waiting for all nodes to process published state [98] (timeout [30s], pending nodes: [{xxx-node-2}{LyQTUunOSbOFNURM3RcmCw}{85NzxYx0RBKZTLagRGdSTg}{10.1.0.5}{10.1.0.5:9300}{xpack.installed=true}])
[2021-05-11T12:49:33,287][WARN ][o.e.d.z.PublishClusterStateAction] [xxx-node-3] publishing cluster state with version [98] failed for the following nodes: [[{xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true}]]

What is the full output of the cluster stats API? Publishing cluster state across a 3 node cluster should not time out so looks concerning.

I will check. Just adding these logs from node 2 which i think is delaying the cluster state.
It looks like it some replica work that hinders node 2?

What settings are recommended for a LAN elasticsearch cluster for a faster master election?
I noticed i've missed to lower the "transport.tcp.connect_timeout" but i will try that now and see if that helps.
Will get back with the cluster stats also.

[2021-05-11T12:49:00,234][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [xxx-node-2] failed to execute on node [1wlMMUtnR_e42BDbGl-aqw]
org.elasticsearch.transport.NodeDisconnectedException: [xxx-node-1][10.1.0.4:9300][cluster:monitor/nodes/stats[n]] disconnected
[2021-05-11T12:49:00,234][WARN ][o.e.d.z.ZenDiscovery ] [xxx-node-2] master left (reason = transport disconnected), current nodes: nodes:
{xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true}
{xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true}, master
{xxx-node-2}{LyQTUunOSbOFNURM3RcmCw}{85NzxYx0RBKZTLagRGdSTg}{10.1.0.5}{10.1.0.5:9300}{xpack.installed=true}, local

[2021-05-11T12:49:18,945][WARN ][o.e.a.b.TransportShardBulkAction] [xxx-node-2] [[xxx][0]] failed to perform indices:data/write/bulk[s] on replica [xxx][0], node[1wlMMUtnR_e42BDbGl-aqw], [R], s[STARTED], a[id=JbeKZ1qRR0-ycfjXuwUx2Q]
org.elasticsearch.transport.NodeNotConnectedException: [xxx-node-1][10.1.0.4:9300] Node not connected

[2021-05-11T12:49:21,259][WARN ][o.e.c.NodeConnectionsService] [xxx-node-2] failed to connect to node {xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [xxx-node-1][10.1.0.4:9300] connect_exception

[2021-05-11T12:49:21,259][INFO ][o.e.c.s.ClusterApplierService] [xxx-node-2] detected_master {xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true}, reason: apply cluster state (from master [master {xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true} committed version [98]])

[2021-05-11T12:50:03,297][WARN ][o.e.c.s.ClusterApplierService] [xxx-node-2] cluster state applier task [apply cluster state (from master [master {xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true} committed version [98]])] took [42s] above the warn threshold of 30s

[2021-05-11T12:50:03,297][INFO ][o.e.c.s.ClusterApplierService] [xxx-node-2] removed {{xxx-node-1}{1wlMMUtnR_e42BDbGl-aqw}{S66VxvgSTbee4w-EG50pkA}{10.1.0.4}{10.1.0.4:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true} committed version [99]])

[2021-05-11T12:50:24,422][WARN ][o.e.t.TransportService ] [xxx-node-2] Received response for a request that has timed out, sent [80011ms] ago, timed out [50111ms] ago, action [internal:discovery/zen/fd/master_ping], node [{xxx-node-3}{INMKjHdhTJSZroYzjWo1Sw}{Fio4CJ4DQcWLPd5fDaTQ2g}{10.1.0.6}{10.1.0.6:9300}{xpack.installed=true}], id [346384]

Cluster stats.

{
"_nodes" : {
  "total" : 3,
  "successful" : 3,
  "failed" : 0
},
"cluster_name" : "xxx-test",
"cluster_uuid" : "SAAkqz4aRByAaleUtNg6sQ",
"timestamp" : 1621851614216,
"status" : "green",
"indices" : {
  "count" : 6,
  "shards" : {
    "total" : 12,
    "primaries" : 6,
    "replication" : 1.0,
    "index" : {
      "shards" : {
        "min" : 2,
        "max" : 2,
        "avg" : 2.0
      },
      "primaries" : {
        "min" : 1,
        "max" : 1,
        "avg" : 1.0
      },
      "replication" : {
        "min" : 1.0,
        "max" : 1.0,
        "avg" : 1.0
      }
    }
  },
  "docs" : {
    "count" : 69296,
    "deleted" : 10
  },
  "store" : {
    "size" : "24.9mb",
    "size_in_bytes" : 26132454
  },
  "fielddata" : {
    "memory_size" : "0b",
    "memory_size_in_bytes" : 0,
    "evictions" : 0
  },
  "query_cache" : {
    "memory_size" : "0b",
    "memory_size_in_bytes" : 0,
    "total_count" : 0,
    "hit_count" : 0,
    "miss_count" : 0,
    "cache_size" : 0,
    "cache_count" : 0,
    "evictions" : 0
  },
  "completion" : {
    "size" : "0b",
    "size_in_bytes" : 0
  },
  "segments" : {
    "count" : 31,
    "memory" : "233.7kb",
    "memory_in_bytes" : 239358,
    "terms_memory" : "145.2kb",
    "terms_memory_in_bytes" : 148762,
    "stored_fields_memory" : "17.1kb",
    "stored_fields_memory_in_bytes" : 17584,
    "term_vectors_memory" : "0b",
    "term_vectors_memory_in_bytes" : 0,
    "norms_memory" : "1.3kb",
    "norms_memory_in_bytes" : 1408,
    "points_memory" : "3.9kb",
    "points_memory_in_bytes" : 4032,
    "doc_values_memory" : "65.9kb",
    "doc_values_memory_in_bytes" : 67572,
    "index_writer_memory" : "0b",
    "index_writer_memory_in_bytes" : 0,
    "version_map_memory" : "0b",
    "version_map_memory_in_bytes" : 0,
    "fixed_bit_set" : "17.3kb",
    "fixed_bit_set_memory_in_bytes" : 17776,
    "max_unsafe_auto_id_timestamp" : -1,
    "file_sizes" : { }
  }
},
"nodes" : {
  "count" : {
    "total" : 3,
    "data" : 2,
    "coordinating_only" : 0,
    "master" : 3,
    "ingest" : 2
  },
  "versions" : [
    "6.5.4"
  ],
  "os" : {
    "available_processors" : 6,
    "allocated_processors" : 6,
    "names" : [
      {
        "name" : "Windows Server 2016",
        "count" : 3
      }
    ],
    "mem" : {
      "total" : "11.9gb",
      "total_in_bytes" : 12883488768,
      "free" : "4.6gb",
      "free_in_bytes" : 4997185536,
      "used" : "7.3gb",
      "used_in_bytes" : 7886303232,
      "free_percent" : 39,
      "used_percent" : 61
    }
  },
  "process" : {
    "cpu" : {
      "percent" : 0
    },
    "open_file_descriptors" : {
      "min" : -1,
      "max" : -1,
      "avg" : 0
    }
  },
  "jvm" : {
    "max_uptime" : "17.7m",
    "max_uptime_in_millis" : 1065429,
    "versions" : [
      {
        "version" : "11.0.2",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "11.0.2+9",
        "vm_vendor" : "Oracle Corporation",
        "count" : 3
      }
    ],
    "mem" : {
      "heap_used" : "553mb",
      "heap_used_in_bytes" : 579921040,
      "heap_max" : "2.2gb",
      "heap_max_in_bytes" : 2363621376
    },
    "threads" : 108
  },
  "fs" : {
    "total" : "379.5gb",
    "total_in_bytes" : 407519612928,
    "free" : "335.9gb",
    "free_in_bytes" : 360718336000,
    "available" : "335.9gb",
    "available_in_bytes" : 360718336000
  },
  "plugins" : [
    {
      "name" : "search-guard-6",
      "version" : "6.5.4-24.0",
      "elasticsearch_version" : "6.5.4",
      "java_version" : "1.8",
      "description" : "Provide access control related features for Elasticsearch 6",
      "classname" : "com.floragunn.searchguard.SearchGuardPlugin",
      "extended_plugins" : [ ],
      "has_native_controller" : false
    },
    {
      "name" : "analysis-icu",
      "version" : "6.5.4",
      "elasticsearch_version" : "6.5.4",
      "java_version" : "1.8",
      "description" : "The ICU Analysis plugin integrates Lucene ICU module into elasticsearch, adding ICU relates analysis components.",
      "classname" : "org.elasticsearch.plugin.analysis.icu.AnalysisICUPlugin",
      "extended_plugins" : [ ],
      "has_native_controller" : false
    }
  ],
  "network_types" : {
    "transport_types" : {
      "com.floragunn.searchguard.ssl.http.netty.SearchGuardSSLNettyTransport" : 3
    },
    "http_types" : {
      "com.floragunn.searchguard.http.SearchGuardHttpServerTransport" : 3
    }
  }
}

When i changed to "transport.tcp.connect_timeout: 1s" the master election and cluster state was set in 6 seconds from the point when the master node was terminated.

Still the client gets
Elasticsearch.Net.ElasticsearchClientException: Failed to ping the specified node.. Call: Status code unknown from: HEAD / ---> Elasticsearch.Net.PipelineException: Failed to ping the specified node. ---> Elasticsearch.Net.PipelineException: An error occurred trying to read the response from the specified node. ---> System.Net.WebException: The request was aborted: The request was canceled.

I think there might be something missing in Nest's error handling.
If we close one server gracefully we get no error in the client.
When we close another one we get an error as expected because now there are no master.
But the error here is .

Elasticsearch.Net.ElasticsearchClientException: Maximum number of retries reached, failed over to all the known alive nodes before failing. Call: Status code unknown from: POST /_bulk ---> System.Net.WebException: The request was aborted: The request was canceled.

For me it looks like "Elasticsearch.Net.PipelineException" doesn't trigger a retry like i believe it should.

You have very small heaps configured which could be causing problems. I would recommend increasing this a bit. What type of hardware is the cluster deployed on? What type of storage are you using?

I have also never used SearchGuard so have no idea what impact on resource usage this has.

This is in our little test environment which is run on small azure vms.
Standard B2s (2 vcpus, 4 GiB memory)).

The production environment is running three nodes on larger local vms with 8gb heap reserved and that environment is getting the same problem.
Cant do test runs there right now.

The disks are Azure "Premium ssd's"

We don't have that much data in elastic so i don't think the heap should run out and I think it would throw if it did?

I've just built an Elasticsearch.Net.dll with extra logging regarding the pool and handling too see if it gets an error that is Recoverable and it retries or if it just fails directly

Can also try without searchguard and see if the problem still exists

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.