Master not discovered exception happening randomly in self hosted AWS

mujtabahussain · September 12, 2017, 10:57pm

Resolution

OK. Thanks to AWS support, finally managed to solve this issue and it was not something I thought of. Turns out I needed to know more about ARP caching on linux. In a subnet, instances coming up and down will end up eventually reusing IP's and on the master nodes, aggressive ARP caching was enabled. This could be checked by the parameter:

$> sudo sysctl net.ipv4.neigh.default.gc_thresh1
128

By resetting this field to 0, master did not cache any MAC/IP's and hence when a new EC2 instance came up with an old IP, master simply allowed it to connect.

A proper explanation of this can be found here if anyone is interested.

Steps to disable caching

Immediately

$> sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=0

For further reboots

$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' | sudo tee /etc/sysctl.d/55-arp-gc_thresh1.conf

In case you bake AMI's like I do

$> echo 'net.ipv4.neigh.default.gc_thresh1 = 0' >> /etc/sysctl.conf

Verification:

Before the change

To confirm if traffic is impacted by this behavior I ended up using tcpdump. On master, run "sudo tcpdump -nn -e port XXXX" and try to connect from the data/ingest node. What we see is the SYN coming in and the SYN-ACK reply with the a MAC address in the Ethernet header. The MAC address of the SYN-ACK reply did not match the MAC address of the client instance, thereby confirming that the traffic was impacted by ARP caching.

Post change

The MAC addresses matched, proving the caching was disabled.

Topic		Replies	Views
Master not found when starting Elasticsearch cluster Elasticsearch	2	3076	May 9, 2017
How to create elastic search cluster on AWS-EC2 using ES-6.5.2 Elasticsearch	18	818	January 10, 2019
Elasticsearch on AWS ECS - Master node itself unable to discover master Elasticsearch	10	1121	January 8, 2021
Master_not_discovered_exception Elasticsearch	2	1833	August 30, 2021
ES v. 5.6.3 in AWS ECS - not enough master nodes discovered during pinging Elasticsearch	2	1027	November 24, 2017

Master not discovered exception happening randomly in self hosted AWS

Resolution

Steps to disable caching

Verification:

Related topics