Hi, I have a ES 5.2 cluster with 3 master nodes and a number of data nodes. I have Kibana/X-pack monitoring installed, but right now there are no indices or real data on this cluster (I'm just testing). I have been experimenting with adding and removing data nodes to see how the cluster reacts.
The problem is that often when I remove a data node, the entire cluster becomes unresponsive for about 90 seconds, I've seen it take a minute 40 seconds too. After that, things go back to normal. Is this normal behavior? If not, any insight on how this could occur (perhaps some timeout settings)? This is intermittent, sometimes it fixes itself immediately.
More details: I am running the ES instances within Docker instances, within a Kubernetes cluster. Each node only has a single Kubernetes Pod, and each Kubernetes Pod only has a single Docker container. This is all being run on Google Cloud platform. My elasticsearch.yml
is below.
cluster:
name: ${CLUSTER_NAME}
node:
name: ${HOSTNAME}
# Set to true/false depending on Dockerfile
master: ${NODE_MASTER}
data: ${NODE_DATA}
network.host:
- _local_
- _site_
path:
data: /data/data
logs: /data/log
http:
enabled: ${HTTP_ENABLE}
compression: true
cors:
enabled: false
cloud:
kubernetes:
service: elasticsearch-discovery
namespace: es-da-cluster
discovery:
type: kubernetes
zen:
minimum_master_nodes: ${NUMBER_MIN_MASTERS}
xpack:
monitoring:
enabled: true
security:
enabled: false
graph:
enabled: false
watcher:
enabled: false
My cluster logs when this occurs are located on pastbin here. I did a tail -f
on these logs while it was occurring. While the cluster was unresponsive, there was no log output. The logs only got appended after the cluster came back to life roughly 90 seconds later.
And to further clarify, by "unresponsive", I mean making any call just hangs. Even if I have ssh-ed into a master node, just curl localhost:9200
will hang for 90 seconds.