Understanding reasons for cluster going to yellow state

I am investigating why our production clusters go from GREEN to YELLOW state. Specifically, I am confused between two kinds of logs that are logged by the master node:

  1. [es-m01-rm] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es-d05-rm}{LyFGgvX1SpKWaWEPi4S_aQ}{2NZ6hZh_Q6qR7ytG9AeL1w}{192.168.0.155}{192.168.0.155:9300}{faultDomain=0, updateDomain=4} failed to ping, tried [3] times, each with maximum [30s] timeout]).
  2. [es-m11-rm] Cluster health status changed from [GREEN] to [YELLOW] (reason: [{es-d14-rm}{Fm1TcGXaQge1Ys3ZKHckaw}{D2i_vY7zQX-C3UKYJcvzrw}{30.0.0.164}{30.0.0.164:9300}{faultDomain=0, updateDomain=0} transport disconnected]).

While I understand what the first error means, I'm not sure what to interpret of the second error. My guess was all connectivity errors should have been of the form of the first error where the data node would ping the master node 3 times and mark it out of cluster on failing to receive ping responses in all 3 times. I want to understand how is the second error different from the first?

Neither indicates a connectivity error for certain, although connectivity is one possible cause. The first indicates that there's an open connection between two nodes but "ping" messages are not receiving responses, which might be because of packet loss or network partition or else because the node is running very slowly (e.g. is under GC pressure). The second indicates that the connection between the two nodes was actively closed, which might be because the remote node was stopped but could also be because of the action of something (e.g. a firewall) that sits between the two nodes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.