Can't route incoming traffic to new nodes

I hope that someone can help as I have been banging my head for a couple weeks on this.

I had a four node 2.0.0 cluster running on Azure VMs. They are all equal nodes, and any can be master. The VMs are part of a load balanced set in Azure, so any one of the VMs can respond to the incoming request, coordinate the inter-node communication, and return the response.

We need to upgrade our cluster to 2.3.2. I attempted to upgrade one of the nodes, but as soon as I brought it online, all communication with ES failed (can't connect via sense, and our back end API started getting 403's). When I brought the node down, and back up as 2.0.0, things went back to normal.

As a workaround, I brought up four new VM's with 2.3.2 from scratch. I then brought down one of the 2.0.0 nodes at a time, waiting for green state before continuing to the next. When I brought the last 2.0.0 node down, all communication failed again. This one was my bad as I forgot to add my new VMs to the Azure load balancer. I brought one of the 2.0.0 nodes back up, and communication was restored. As the 2.0.0 node was now an older version, and the other data had been relocated to the 2.3.2 nodes, no shards were copied to the old 2.0.0 node. It is currently just processing incoming request (i.e. it is not master, and is not hosting any shards).

I added one of the 2.3.2 nodes to the load balancer. In Network Monitor I can see it start processing incoming requests on port 9200, but our API immediately starts getting 403's. I remove the VM from the load balancer, incoming traffic stops (though inter-node traffic on 9300 is still fine), and our API is happy.

In the log, I do see:

[2016-08-01 05:04:03,668][INFO ][rest.suppressed ] /_nodes Params: {}
java.util.ConcurrentModificationException
at java.util.ArrayList.sort(Unknown Source)
at java.util.Collections.sort(Unknown Source)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.getInfos(PluginsInfo.java:55)
at org.elasticsearch.action.admin.cluster.node.info.PluginsInfo.toXContent(PluginsInfo.java:94)

Thinking it to be a problem with the 2.0.0 and 2.3.2 plugins, I brought the 2.0.0 node down again and tried bringing it up as 2.3.2, but again, everything breaks.

I am now in a risky scenario where if this node goes down, the entire cluster is inaccessible. Does anyone have any suggestions? Thanks!

Note that I have these plug-ins installed:

delete-by-query
HQ
Head

I have found this:
Fix ConcurrentModificationException from nodes info and nodes stats #15541

It shows this as a commit in 2.1. When I brought up the old node with 2.3.2 I would think this would not have been an issue.

Dug through log and found this. I only brought the node up as 2.3.2 for a minute or so, so my shutting it down could have caused the errors (just a guess):

[2016-08-01 05:07:33,520][INFO ][node ] [Poundcakes] version[2.3.2], pid[480], build[b9e4a6a/2016-04-21T16:03:47Z]
[2016-08-01 05:07:33,520][INFO ][node ] [Poundcakes] initializing ...
[2016-08-01 05:07:36,926][INFO ][plugins ] [Poundcakes] modules [reindex, lang-expression, lang-groovy], plugins [head, delete-by-query, hq], sites [head, hq]
[2016-08-01 05:07:37,098][INFO ][env ] [Poundcakes] using [1] data paths, mounts [[Elasticsearch Data (F:)]], net usable_space [1022gb], net total_space [1022.9gb], spins? [unknown], types [NTFS]
[2016-08-01 05:07:37,098][INFO ][env ] [Poundcakes] heap size [13.4gb], compressed ordinary object pointers [true]
[2016-08-01 05:07:46,398][INFO ][node ] [Poundcakes] initialized
[2016-08-01 05:07:46,398][INFO ][node ] [Poundcakes] starting ...
[2016-08-01 05:07:46,695][INFO ][transport ] [Poundcakes] publish_address {10.0.0.8:9300}, bound_addresses {10.0.0.8:9300}
[2016-08-01 05:07:46,695][INFO ][discovery ] [Poundcakes] elasticsearch/ocGAfqwfRiK_AMeRMB9dbw
[2016-08-01 05:07:49,815][INFO ][cluster.service ] [Poundcakes] detected_master {Iridia}{KIkPGWvSQsmxBkKFCmlITw}{10.0.0.11}{10.0.0.11:9300}, added {{Iridia}{KIkPGWvSQsmxBkKFCmlITw}{10.0.0.11}{10.0.0.11:9300},{Futurist}{gDoveVWAStGeQ1lamFn6uw}{10.0.0.4}{10.0.0.4:9300},{Spider-Girl}{qIZOG8vJRxGzf9dfxGRA6w}{10.0.0.5}{10.0.0.5:9300},{Agent X}{vBiHqc3qT9m_vb8Fz5-hlw}{10.0.0.6}{10.0.0.6:9300},}, reason: zen-disco-receive(from master [{Iridia}{KIkPGWvSQsmxBkKFCmlITw}{10.0.0.11}{10.0.0.11:9300}])
[2016-08-01 05:07:49,987][INFO ][http ] [Poundcakes] publish_address {10.0.0.8:9200}, bound_addresses {10.0.0.8:9200}
[2016-08-01 05:07:49,987][INFO ][node ] [Poundcakes] started
[2016-08-01 05:08:51,310][INFO ][bootstrap ] running graceful exit on windows
[2016-08-01 05:08:51,310][INFO ][node ] [Poundcakes] stopping ...
[2016-08-01 05:08:51,371][WARN ][netty.channel.DefaultChannelPipeline] An exception was thrown by an exception handler.
java.util.concurrent.RejectedExecutionException: Worker has already been shutdown
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.registerTask(AbstractNioSelector.java:120)
...
[2016-08-01 05:08:51,371][WARN ][indices.cluster ] [Poundcakes] [[ml_v7][0]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[ml_v7][0]: Recovery failed from {Agent X}{vBiHqc3qT9m_vb8Fz5-hlw}{10.0.0.6}{10.0.0.6:9300} into {Poundcakes}{ocGAfqwfRiK_AMeRMB9dbw}{10.0.0.8}{10.0.0.8:9300}]; nested: TransportException[transport stopped, action: internal:index/shard/recovery/start_recovery];
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:258)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$1100(RecoveryTarget.java:69)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:508)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: TransportException[transport stopped, action: internal:index/shard/recovery/start_recovery]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
... 3 more
[2016-08-01 05:08:51,387][WARN ][cluster.action.shard ] [Poundcakes] failed to send failed shard to {Iridia}{KIkPGWvSQsmxBkKFCmlITw}{10.0.0.11}{10.0.0.11:9300}
SendRequestTransportException[[Iridia][10.0.0.11:9300][internal:cluster/shard/failure]]; nested: TransportException[TransportService is closed stopped can't send request];
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:340)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:299)
...
Caused by: TransportException[TransportService is closed stopped can't send request]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:320)
... 17 more

I have brought up another of my deprecated 2.0.0 nodes to handle incoming requests so I have some resiliency, but I still need to figure out how to have my 2.3.2 nodes take this over.

ping

Can you telnet to 9200/9300 on all nodes when they are up?
What does _cat/nodes show?

Very strange. When I join a 2.3.2 node to the load balanced set, I can still query (hitting my Azure Cloud Service endpoint as the URL) fine. I can search, and _ca/nodes returns data on all my nodes. Calling from my API fails. I found the data in a Network Monitor trace, and I see a "Status: Forbidden":

  5173	3:43:34 PM 8/2/2016	256.2907621	System	xxx.xxx.xxx.x	10.0.0.4	HTTP	HTTP:HTTP Payload, URL: /ocv/_search 	{HTTP:145, TCP:144, IPv4:132}
  5174	3:43:34 PM 8/2/2016	256.2907733	System	10.0.0.4	xxx.xxx.xxx.x	TCP	TCP:Flags=...A...., SrcPort=9200, DstPort=59654, PayloadLen=0, Seq=1111979944, Ack=4008854303, Win=4140 (scale factor 0x8) = 1059840	{TCP:144, IPv4:132}
  5175	3:43:34 PM 8/2/2016	256.2912859	System	10.0.0.4	xxx.xxx.xxx.x	HTTP	HTTP:Response, HTTP/1.1, Status: Forbidden, URL: /ocv/_search 	{HTTP:145, TCP:144, IPv4:132}

I should say that in the above, "I can still query via curl using Sense fine"

RESOLVED

It turned out that we had this in our config.yml:

http.cors.enabled : true

I think this was left over from some experiments early on. This did not cause a problem with 2.0.0, but I think with #18256 it caused things to fail when coming in from the origin of my API. I pulled this out of the config and after joining the node, the calls were accepted.