Master not discovered exception happening randomly in self hosted AWS

Hello all.

Setup:

Configurable number of data nodes (3 at the moment), one dedicated master node and configurable number of ingest nodes (based on load) that exist in the same private subnet inside a VPC.

The master node exists in its own AutoScalingGroup with Min: 1, Max: 1, Desired 1.

Kibana exists on a single node in a public subnet and talks to the ES cluster via an internal AWS::ELB.

Configuration:

  • For the data node
+  44 network:
+  45   publish_host: "_ec2:privateIpv4_"
+  46   host: "0.0.0.0"
+  47
   48 discovery:
   49   zen:
   50     ping:
   51       multicast:
   52         enabled: false
   53   type: "ec2"
   54   host_type: "private_ip"
   55   ec2:
   56     tag:
   57       node: "es-node"
_  58
   59 cloud:
   60   node:
   61     auto_attributes: true
   62   aws:
   63     protocol: "http"
   64     region: "ap-southeast-2"
  • For master node
   41 discovery:
   42   zen:
   43     ping:
   44       multicast:
   45         enabled: false
   46   type: "ec2"
   47   host_type: "private_ip"
   48   ec2:
   49     tag:
   50       node: "es-node"
_  51
   52 cloud:
   53   node:
   54     auto_attributes: true
   55   aws:
   56     protocol: "http"
   57     region: "ap-southeast-2"
_  58
   59 network:
   60   publish_host: "_ec2:privateIpv4_"
   61   host: "0.0.0.0"
  • For the ingest node
   41 discovery:
   42   zen:
   43     ping:
   44       multicast:
   45         enabled: false
   46   type: "ec2"
   47   host_type: "private_ip"
   48   ec2:
   49     tag:
   50       node: "es-node"
_  51
   52 cloud:
   53   node:
   54     auto_attributes: true
   55   aws:
   56     protocol: "http"
   57     region: "ap-southeast-2"
_  58
   59 network:
   60   publish_host: "_ec2:privateIpv4_"
   61   host: "0.0.0.0"

Issue

The master node, randomly and every now and then, goes missing. What I mean by missing is, that the ingest nodes at times cannot find the master node and hence cannot add new data at all. And the only thing that seems to be able to get them to find the master node is to restart elasticsearch on master node which rebalances the entire cluster, and causes me to cry into my coffee.

And yet, post restart on the master node, all nodes can find the master easily and all works like clockwork, till it happens again.

And its not the ingest nodes only that cannot find the master node, at times the kibana node, via the ELB, also cannot find the master node at all causing Kibana to give me all sorts of heart stopping messages.

Attempts at resolution:

I have search SO, reddit.com/r/elasticsearch and discuss.elastic.co high and low but I cannot find anything that resembles a solution. Especially since its intermittent. :frowning:

Any help would be appreciated.

Reward Offered

I will sing The Hills are alive from Sound of Music loudly in my office!

Thanks

If you are looking for HA, you should always have 3 master eligible nodes with minimum_master_nodes set to 2. This allows the cluster to form a majority and elect a new master even if the existing master runs into problems, e.g. GC or connectivity. I would also recommend looking into logs and monitoring data to see if you can identify anything causing the master node to fail.

What instance type are you using for master node(s)?

Hey. Thanks for replying.

If you are looking for HA, you should always have 3 master eligible nodes

When you say three master eligible nodes, should I have three dedicated master eligible nodes? Or do you mean let a data node also possibly serve as a master node?

What instance type are you using for master node(s)?

I am using an m3.large for a master node.

Settings for the master node are also as such

 49 echo "ES_HEAP_SIZE=4g" >> /etc/default/elasticsearch
 50 echo "ES_JAVA_OPTS=\"-Xms4g -Xmx4g\"" >> /etc/default/elasticsearch
 51 echo "MAX_LOCKED_MEMORY=unlimited" >> /etc/default/elasticsearch

That is a good instance type for a master node. For larger clusters we generally recommend having 3 dedicated master nodes, but you are likely to see improved stability even if you make 2 of the data nodes master eligible, at least as long as you are not overloading them.

That is a good instance type for a master node

Woohoo.

we generally recommend having 3 dedicated master nodes

No issues. I am about to set three dedicated master nodes for the cluster.

Quick question. Do I need to set the config option discovery.zen.minimum_master_nodes in every node of the cluster or just the master nodes?

Resolution

OK. Thanks to AWS support, finally managed to solve this issue and it was not something I thought of. Turns out I needed to know more about ARP caching on linux. In a subnet, instances coming up and down will end up eventually reusing IP's and on the master nodes, aggressive ARP caching was enabled. This could be checked by the parameter:

$> sudo sysctl net.ipv4.neigh.default.gc_thresh1
128

By resetting this field to 0, master did not cache any MAC/IP's and hence when a new EC2 instance came up with an old IP, master simply allowed it to connect.

A proper explanation of this can be found here if anyone is interested.

Steps to disable caching

  • Immediately
$> sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=0
  • For further reboots
$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' | sudo tee /etc/sysctl.d/55-arp-gc_thresh1.conf
  • In case you bake AMI's like I do
$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' >> /etc/sysctl.conf

Verification:

  • Before the change

To confirm if traffic is impacted by this behavior I ended up using tcpdump. On master, run "sudo tcpdump -nn -e port XXXX" and try to connect from the data/ingest node. What we see is the SYN coming in and the SYN-ACK reply with the a MAC address in the Ethernet header. The MAC address of the SYN-ACK reply did not match the MAC address of the client instance, thereby confirming that the traffic was impacted by ARP caching.

  • Post change

The MAC addresses matched, proving the caching was disabled.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.