Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves

Michael_Sander · February 3, 2017, 5:16am

Hi, I have a ES 5.2 cluster with 3 master nodes and a number of data nodes. I have Kibana/X-pack monitoring installed, but right now there are no indices or real data on this cluster (I'm just testing). I have been experimenting with adding and removing data nodes to see how the cluster reacts.

The problem is that often when I remove a data node, the entire cluster becomes unresponsive for about 90 seconds, I've seen it take a minute 40 seconds too. After that, things go back to normal. Is this normal behavior? If not, any insight on how this could occur (perhaps some timeout settings)? This is intermittent, sometimes it fixes itself immediately.

More details: I am running the ES instances within Docker instances, within a Kubernetes cluster. Each node only has a single Kubernetes Pod, and each Kubernetes Pod only has a single Docker container. This is all being run on Google Cloud platform. My elasticsearch.yml is below.

cluster:
  name: ${CLUSTER_NAME}
node:
  name: ${HOSTNAME}
  # Set to true/false depending on Dockerfile
  master: ${NODE_MASTER}
  data: ${NODE_DATA}
network.host:
  - _local_
  - _site_
path:
  data: /data/data
  logs: /data/log
http:
  enabled: ${HTTP_ENABLE}
  compression: true
  cors:
    enabled: false
cloud:
  kubernetes:
    service: elasticsearch-discovery
    namespace: es-da-cluster
discovery:
  type: kubernetes
  zen:
    minimum_master_nodes: ${NUMBER_MIN_MASTERS}
xpack:
  monitoring:
    enabled: true
  security:
    enabled: false
  graph:
    enabled: false
  watcher:
    enabled: false

My cluster logs when this occurs are located on pastbin here. I did a tail -f on these logs while it was occurring. While the cluster was unresponsive, there was no log output. The logs only got appended after the cluster came back to life roughly 90 seconds later.

And to further clarify, by "unresponsive", I mean making any call just hangs. Even if I have ssh-ed into a master node, just curl localhost:9200 will hang for 90 seconds.

Michael_Sander · February 3, 2017, 7:30am

I figured it out myself. The Kubernetes discovery mechanism uses the zen unicast system under the hood. The default fault detection requires 90 seconds (30 second timeouts, 3 times). Just need to change the settings under discovery.zen.fd. Not sure I understand the defaults too well here.

system · March 3, 2017, 7:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES cluster becomes unresponsive Elasticsearch	2	696	July 6, 2017
New Elasticsearch 7.6.0 cluster eventually becomes unresponsive Elasticsearch	3	369	April 13, 2020
Cluster node unresponsive after search Elasticsearch	2	662	July 5, 2017
Cluster become unresponsive after receiving data for sometime using EC2 Discovery Elasticsearch	1	490	March 21, 2018
Cluster hanging on node failure Elasticsearch	2	527	July 6, 2017

Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves

Related topics