I hope that someone can help as I have been banging my head for a couple weeks on this.
I had a four node 2.0.0 cluster running on Azure VMs. They are all equal nodes, and any can be master. The VMs are part of a load balanced set in Azure, so any one of the VMs can respond to the incoming request, coordinate the inter-node communication, and return the response.
We need to upgrade our cluster to 2.3.2. I attempted to upgrade one of the nodes, but as soon as I brought it online, all communication with ES failed (can't connect via sense, and our back end API started getting 403's). When I brought the node down, and back up as 2.0.0, things went back to normal.
As a workaround, I brought up four new VM's with 2.3.2 from scratch. I then brought down one of the 2.0.0 nodes at a time, waiting for green state before continuing to the next. When I brought the last 2.0.0 node down, all communication failed again. This one was my bad as I forgot to add my new VMs to the Azure load balancer. I brought one of the 2.0.0 nodes back up, and communication was restored. As the 2.0.0 node was now an older version, and the other data had been relocated to the 2.3.2 nodes, no shards were copied to the old 2.0.0 node. It is currently just processing incoming request (i.e. it is not master, and is not hosting any shards).
I added one of the 2.3.2 nodes to the load balancer. In Network Monitor I can see it start processing incoming requests on port 9200, but our API immediately starts getting 403's. I remove the VM from the load balancer, incoming traffic stops (though inter-node traffic on 9300 is still fine), and our API is happy.
In the log, I do see:
[2016-08-01 05:04:03,668][INFO ][rest.suppressed ] /_nodes Params: {}
java.util.ConcurrentModificationException
at java.util.ArrayList.sort(Unknown Source)
at java.util.Collections.sort(Unknown Source)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.getInfos(PluginsInfo.java:55)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.toXContent(PluginsInfo.java:94)
Thinking it to be a problem with the 2.0.0 and 2.3.2 plugins, I brought the 2.0.0 node down again and tried bringing it up as 2.3.2, but again, everything breaks.
I am now in a risky scenario where if this node goes down, the entire cluster is inaccessible. Does anyone have any suggestions? Thanks!