Master not discovered exception happening randomly in self hosted AWS

Resolution

OK. Thanks to AWS support, finally managed to solve this issue and it was not something I thought of. Turns out I needed to know more about ARP caching on linux. In a subnet, instances coming up and down will end up eventually reusing IP's and on the master nodes, aggressive ARP caching was enabled. This could be checked by the parameter:

$> sudo sysctl net.ipv4.neigh.default.gc_thresh1
128

By resetting this field to 0, master did not cache any MAC/IP's and hence when a new EC2 instance came up with an old IP, master simply allowed it to connect.

A proper explanation of this can be found here if anyone is interested.

Steps to disable caching

  • Immediately
$> sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=0
  • For further reboots
$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' | sudo tee /etc/sysctl.d/55-arp-gc_thresh1.conf
  • In case you bake AMI's like I do
$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' >> /etc/sysctl.conf

Verification:

  • Before the change

To confirm if traffic is impacted by this behavior I ended up using tcpdump. On master, run "sudo tcpdump -nn -e port XXXX" and try to connect from the data/ingest node. What we see is the SYN coming in and the SYN-ACK reply with the a MAC address in the Ethernet header. The MAC address of the SYN-ACK reply did not match the MAC address of the client instance, thereby confirming that the traffic was impacted by ARP caching.

  • Post change

The MAC addresses matched, proving the caching was disabled.