Problem restarting cluster on 6.6.1

Elasticsearch 6.6.1

I have a 3-node cluster. 2 master+data, 1 data only. Our setup has been running ok for few months until I have to restart them for any reason.

The problem is that I have to restart the machines in seemingly randomly multiple times in order to beat "master not discovered" exception. Today there was even more strange situation when cluster health reported 2 nodes, then 3 nodes, then 2, then 3, etc. My configuration has been the same from the beginning. Today I spent an hour just restarting the service continuously until it just fixed itself and now I see recovering cluster with 3 nodes.

This is my configuration. I'm also open to suggestions about for VM specs for these machines. Each node is 2-core 8GB machine with jvm memory set to 4g with plans to scale them up as needed.

esdata-0
cluster.name: "analytics-cluster"
node.name: "esdata-0"
path.logs: /var/log/elasticsearch
path.data: /datadisks/disk1/elasticsearch/data
discovery.zen.ping.unicast.hosts: ["esdata-0:9300","esdata-1:9300","esdata-2:9300"]
node.master: true
node.data: true
discovery.zen.minimum_master_nodes: 2
network.host: [site, local]
node.max_local_storage_nodes: 1
node.attr.fault_domain: 1
node.attr.update_domain: 1
cluster.routing.allocation.awareness.attributes: fault_domain,update_domain
xpack.license.self_generated.type: trial
xpack.security.enabled: true
bootstrap.memory_lock: true
thread_pool.index.queue_size: 1000
thread_pool.write.queue_size: 1000

esdata-1
cluster.name: "analytics-cluster"
node.name: "esdata-1"
path.logs: /var/log/elasticsearch
path.data: /datadisks/disk1/elasticsearch/data
discovery.zen.ping.unicast.hosts: ["esdata-0:9300","esdata-1:9300","esdata-2:9300"]
node.master: true
node.data: true
discovery.zen.minimum_master_nodes: 2
network.host: [site, local]
node.max_local_storage_nodes: 1
node.attr.fault_domain: 0
node.attr.update_domain: 0
cluster.routing.allocation.awareness.attributes: fault_domain,update_domain
xpack.license.self_generated.type: trial
xpack.security.enabled: true
bootstrap.memory_lock: true
thread_pool.index.queue_size: 1000
thread_pool.write.queue_size: 1000

esdata-2
cluster.name: "analytics-cluster"
node.name: "esdata-2"
path.logs: /var/log/elasticsearch
path.data: /datadisks/disk1/elasticsearch/data
discovery.zen.ping.unicast.hosts: ["esdata-0:9300","esdata-1:9300","esdata-2:9300"]
node.master: false
node.data: true
discovery.zen.minimum_master_nodes: 2
network.host: [site, local]
node.max_local_storage_nodes: 1
node.attr.fault_domain: 2
node.attr.update_domain: 2
cluster.routing.allocation.awareness.attributes: fault_domain,update_domain
xpack.license.self_generated.type: trial
xpack.security.enabled: true
bootstrap.memory_lock: true
thread_pool.index.queue_size: 1000
thread_pool.write.queue_size: 1000

Can you share the logs from your failed attempts to get these nodes to form a cluster?

Also why do you have one of your three nodes set node.master: false? You need at least three master-eligible nodes to run a fault-tolerant cluster.

For some reason I had one of the nodes as non-master-eligible. I changed it immediately after reading your comment and today one of our nodes crashed (suspecting out of memory but thats not the issue here). And now I have the same problem with 3 master-nodes: you can see the status change between 2/2 and 3/3 nodes all the time (few seconds apart)

Here is a log file: https://pastebin.com/nuHhEPJi (there was a LOT of logs caused by services trying to use the ES API and I tried to strip them out. hopefully I did not exclude anything important)

After a bunch of restart attempts the cluster started OK and is now working again. I would like to avoid this in the future. Shouldn't the cluster get back online after restarting just the failed node?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.