Node upgraded 2.0.0 to 2.3.2 can't communicate with other nodes in cluster

We have a 4-node Elasticsearch cluster running 2.0.0. We upgraded our test environment to 2.3.2, but are running into problems upgrading our production environment. I followed the process for rolling upgrade. When the upgraded node joins the cluster, queries on the site begin to fail to return. I assume this is happening when the request is route do the upgraded node.

It seems that the newly upgraded node is unable to communicate with the other nodes (i.e. the master), and we have it configured that a single node can't become master in the absence of a quorum (i.e. the "split brain" issue).

My full description is 4x the length allowed to post here, so the full details with log data can be found here:

http://stackoverflow.com/questions/38464357/site-queries-fail-after-bringing-upgraded-elasticsearch-cluster-online

Are there any known issues with bringing a node upgraded to 2.3.2 online in a cluster with other nodes at 2.0.0? When I brought the node down and back up with 2.0.0, all works fine. The config.yml and all environment settings (such as ES_HEAP_SIZE, which is at 7g on my system with 14 GB ram) are identical. The only thing that changes is the version of ES.

thanks! ~john

Hi John,

I'd concentrate on the OutOfMemoryErrors. Can you first check that the 2.3.3 Elasticsearch process uses 7 GB heap space indeed? If you have a JDK on your production server you can check the exact command with all command line parameters for all Java processes on the machine with jps -v.

It should show something like:

18795 Elasticsearch -Xms256m -Xmx2g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -XX:+AlwaysPreTouch -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:+HeapDumpOnOutOfMemoryError -Des.path.home=/Users/dm/tests/elasticsearch-5.0.0-alpha4

If jps is not available, you can also use ps.

If you're seeing indeed 7GB for Xmsand Xmx then you can look why it is using so much memory. You can produce a heap dump when the OufOfMemoryError occurs by adding -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/your/heapdump_file" to ES_JAVA_OPTS.

Daniel

I think all my logs above were red herrings. I brought the upgraded node up again, and again my site started failing, but there was nothing amiss in that nodes logs. I looked at the logs on the master node, and it was filled with exceptions pointing to my upgraded node. They were all the following (pasted w/o call stacks). Manta is the master and Unthinnk the upgraded node.

[2016-07-20 15:45:48,589][WARN ][gateway ] [Manta] [ml_v7][1]: failed to list shard for shard_store on node [Qg3ghrCkT7mjUoc7hmItiA]
FailedNodeException[Failed node [Qg3ghrCkT7mjUoc7hmItiA]]; nested: RemoteTransportException[[Unthinnk][10.0.0.7:9300][internal:cluster/nodes/indices/shard/store[n]]]; nested: ElasticsearchException[Failed to list store metadata for shard [[ml_v7][1]]]; nested: IndexFormatTooNewException[Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="F:\data\elasticsearch\nodes\0\indices\ml_v7\1\index\segments_1s1"))): 6 (needs to be between 0 and 5)];
Caused by: RemoteTransportException[[Unthinnk][10.0.0.7:9300][internal:cluster/nodes/indices/shard/store[n]]]; nested: ElasticsearchException[Failed to list store metadata for shard [[ml_v7][1]]]; nested: IndexFormatTooNewException[Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="F:\data\elasticsearch\nodes\0\indices\ml_v7\1\index\segments_1s1"))): 6 (needs to be between 0 and 5)];
Caused by: ElasticsearchException[Failed to list store metadata for shard [[ml_v7][1]]]; nested: IndexFormatTooNewException[Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="F:\data\elasticsearch\nodes\0\indices\ml_v7\1\index\segments_1s1"))): 6 (needs to be between 0 and 5)];
Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(SimpleFSIndexInput(path="F:\data\elasticsearch\nodes\0\indices\ml_v7\1\index\segments_1s1"))): 6 (needs to be between 0 and 5)

I see that we have this:

ES 2.0.0 --> Lucene 5.2.1
ES 2.3.2 --> Lucene 5.5.0

Is there any compatibility issues between bringing a node up with 5.5.0 with the rest of the cluster 5.2.1?

Shards that have been allocated to the newer node are upgraded and can subsequently not be allocated back to nodes with lower version, so you should complete the rolling upgrade to ensure that all nodes in the cluster are running the same Elasticsearch version.

@jthoni When doing a rolling upgrade, Elasticsearch should not relocate indices from new nodes to old nodes, so you shouldn't get the format version exceptions that you're seeing. Did you start a 2.3.2 node then stop it and go back to 2.0.0 on the same box? If so, that would account for the "format version too new" exceptions.

Well, that would explain the exceptions. I took the node down, then brought it up with 2.3.2. All queries to the production cluster started failing, so I immediately took it down and brought it back up with 2.0.0. I guess the index got flagged as 2.3.2, so that accounts for the exception.

I would assume, therefore, that data in the now 2.3.2 flagged shard will not be getting replicated to another node (as all the others are 2.0.0), right?

To verify, I should be able to use Rolling Upgrade to go from 2.0.0 to 2.3.2 without taking down the entire cluster, right? I am now hesitant to experiment as it is having adverse effects on production (note that the upgrade worked fine in our test environment).