I have been running into many problems trying to upgrade my cluster from 2.0.0 to 2.3.2 (see Problems with mixed version cluster during rolling upgrade?).
I finally went ahead and brought up new clean VM's with 2.3.2 and added them to the cluster. Once all allocation was complete, I pulled one 2.0.0 node out at a time, waiting for state to return to green before removing the next.
This process went fine until I took out the 2.0.0 node that had been the master. When I took it out, the entire cluster lost communication (i.e. I could not even get stats with Sense). I bought that node back up, and 30 seconds later the cluster was back. A 2.3.2 cluster was selected as master, so now this 2.0.0 node is in the cluster, but it is not the master, and is not hosting any shards. It does not seem to have any purpose, but when I take it down, the entire cluster fails.
I tried digging into logs, and I am finding that the 2.0.0 node is the only one outputting any logs. If I send an intentionally malformed query to the cluster, the "Failed to execute" log entry appears on the 2.0.0 node, but I see nothing on the 2.3.2 nodes. This leads me to think that logging is at the root of this such that if I bring down 2.0.0 the cluster fails as it can't write logs. Why would this be? I am using the same logging strategy on the new nodes as I have on all my other nodes. In config.yml I have:
path.logs: F:\logs
I run ES as a service, so in bin/service.bat manager I have set the "Log path" to "F:\logs."
Note that on the new nodes I do see things like the cluster.service remove/added messages when other nodes come up / go down.
Thanks!
~john