Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves

Hi, I have a ES 5.2 cluster with 3 master nodes and a number of data nodes. I have Kibana/X-pack monitoring installed, but right now there are no indices or real data on this cluster (I'm just testing). I have been experimenting with adding and removing data nodes to see how the cluster reacts.

The problem is that often when I remove a data node, the entire cluster becomes unresponsive for about 90 seconds, I've seen it take a minute 40 seconds too. After that, things go back to normal. Is this normal behavior? If not, any insight on how this could occur (perhaps some timeout settings)? This is intermittent, sometimes it fixes itself immediately.

More details: I am running the ES instances within Docker instances, within a Kubernetes cluster. Each node only has a single Kubernetes Pod, and each Kubernetes Pod only has a single Docker container. This is all being run on Google Cloud platform. My elasticsearch.yml is below.

cluster:
  name: ${CLUSTER_NAME}
node:
  name: ${HOSTNAME}
  # Set to true/false depending on Dockerfile
  master: ${NODE_MASTER}
  data: ${NODE_DATA}
network.host:
  - _local_
  - _site_
path:
  data: /data/data
  logs: /data/log
http:
  enabled: ${HTTP_ENABLE}
  compression: true
  cors:
    enabled: false
cloud:
  kubernetes:
    service: elasticsearch-discovery
    namespace: es-da-cluster
discovery:
  type: kubernetes
  zen:
    minimum_master_nodes: ${NUMBER_MIN_MASTERS}
xpack:
  monitoring:
    enabled: true
  security:
    enabled: false
  graph:
    enabled: false
  watcher:
    enabled: false

My cluster logs when this occurs are located on pastbin here. I did a tail -f on these logs while it was occurring. While the cluster was unresponsive, there was no log output. The logs only got appended after the cluster came back to life roughly 90 seconds later.

And to further clarify, by "unresponsive", I mean making any call just hangs. Even if I have ssh-ed into a master node, just curl localhost:9200 will hang for 90 seconds.

I figured it out myself. The Kubernetes discovery mechanism uses the zen unicast system under the hood. The default fault detection requires 90 seconds (30 second timeouts, 3 times). Just need to change the settings under discovery.zen.fd. Not sure I understand the defaults too well here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.