Upgrade from 5.1.0 to 5.1.1 on Centos6 Makes a node unable to join the cluster

I have a simple 3 node ELK 5.1.0 Cluster that i am trying to upgrade to 5.1.1 using provided RPMs .

I can't see anything that could cause this on the release notes for 5.1.1

The cluster work as expected as 5.1.0 but when upgrading one node to 5.1.1 and starting it up with exactly the same configuration i get the following (only pasting what i think is relevant out of the java stack trace )

[2016-12-12T10:08:53,557][INFO ][o.e.n.Node               ] [infra-elk-es1.lon-dc.mintel.ad] initializing ...
[2016-12-12T10:08:53,610][INFO ][o.e.e.NodeEnvironment    ] [infra-elk-es1.lon-dc.mintel.ad] using [1] data paths, mounts [[/data (/dev/vdb)]], net usabl
e_space [397.8gb], net total_space [499.7gb], spins? [possibly], types [xfs]
[2016-12-12T10:08:53,611][INFO ][o.e.e.NodeEnvironment    ] [infra-elk-es1.lon-dc.mintel.ad] heap size [7.9gb], compressed ordinary object pointers [true
]
[2016-12-12T10:08:54,674][INFO ][o.e.n.Node               ] [infra-elk-es1.lon-dc.mintel.ad] node name [infra-elk-es1.lon-dc.mintel.ad], node ID [vN6s_3-
XS2i11w1v59o8dg]
[2016-12-12T10:08:54,676][INFO ][o.e.n.Node               ] [infra-elk-es1.lon-dc.mintel.ad] version[5.1.1], pid[18954], build[5395e21/2016-12-06T12:36:1
5.409Z], OS[Linux/2.6.32-642.6.2.el6.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b15]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [aggs-matrix-stats]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [ingest-common]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [lang-expression]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [lang-groovy]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [lang-mustache]
[2016-12-12T10:08:55,317][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [lang-painless]
[2016-12-12T10:08:55,318][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [percolator]
[2016-12-12T10:08:55,318][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [reindex]
[2016-12-12T10:08:55,318][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [transport-netty3]
[2016-12-12T10:08:55,318][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] loaded module [transport-netty4]
[2016-12-12T10:08:55,318][INFO ][o.e.p.PluginsService     ] [infra-elk-es1.lon-dc.mintel.ad] no plugins loaded
[2016-12-12T10:09:01,099][INFO ][o.e.n.Node               ] [infra-elk-es1.lon-dc.mintel.ad] initialized
[2016-12-12T10:09:01,099][INFO ][o.e.n.Node               ] [infra-elk-es1.lon-dc.mintel.ad] starting ...
[2016-12-12T10:09:01,223][INFO ][o.e.t.TransportService   ] [infra-elk-es1.lon-dc.mintel.ad] publish_address {172.31.0.6:9300}, bound_addresses {0.0.0.0:
9300}
[2016-12-12T10:09:01,228][INFO ][o.e.b.BootstrapCheck     ] [infra-elk-es1.lon-dc.mintel.ad] bound or publishing to a non-loopback or non-link-local addr
ess, enforcing bootstrap checks
[2016-12-12T10:09:04,305][INFO ][o.e.d.z.ZenDiscovery     ] [infra-elk-es1.lon-dc.mintel.ad] failed to send join request to master [{infra-elk-es4.lon-dc
.mintel.ad}{5CA_TtyUSDau2ZtJBKWzyQ}{y73HV3ReSi67NGuyk9Shhg}{172.31.1.232}{172.31.1.232:9300}], reason [RemoteTransportException[[Failed to deserialize ex
ception response from stream]]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: IllegalArgumentExc
eption[port out of range:2380801]; ]
[2016-12-12T10:09:04,332][WARN ][o.e.t.n.Netty4Transport  ] [infra-elk-es1.lon-dc.mintel.ad] exception caught on transport layer [[id: 0xad57d6a8, L:/172
.31.0.6:40484 - R:172.31.1.232/172.31.1.232:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [14], handler [org.elasticsearch.transport.TransportService$ContextResto
reResponseHandler/future(org.elasticsearch.transport.EmptyTransportResponseHandler@77d1e794)], error [true]; resetting

The line about port being out of range is particularly interesting to me ...

port out of range:2380801 

Sure that port is out of range if we are talking tcp ports ...

After this the node try to join again and again with the same error and the same exact "port out of range" message.

# java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)

Reverting the node to 5.1.0 fix the problem for now

Anyone experiencing the same issue or has any idea what am i doing wrong ?

Did you really have a 5.1.0 version before? Asking that because it should not have been released as not ready for production.

From: Elastic Stack 5.1.1 Released | Elastic Blog

Yup, you read that right. Version 5.1.0 doesn’t exist because, for a short period of time, the Elastic Yum and Apt repositories included unreleased binaries labeled 5.1.0. To avoid any confusion, and upgrade issues for the people that have installed these without realizing, we have decided to skip the 5.1.0 version and release 5.1.1 instead.

So can you confirm what gives:

GET /

I do mirror the repo so was lucky enough to mirror during that short period of time i guess.

[root@infra-elk-es1 ~]# curl localhost:9200/ 2>/dev/null | jq .
{
  "tagline": "You Know, for Search",
  "version": {
    "lucene_version": "6.3.0",
    "build_snapshot": false,
    "build_date": "2016-11-24T08:20:05.232Z",
    "build_hash": "e5e3f1f",
    "number": "5.1.0"
  },
  "cluster_uuid": "FQXt2XxHSXaxbAiHXEea9g",
  "cluster_name": "infra_elk",
  "name": "infra-elk-es1.lon-dc.mintel.ad"
}

Could there be a flag in 5.1.1 that prevents joining a 5.1.0 , kinda marked as broken ?

I can totally bring down this cluster and upgrade all nodes to 5.1.1 and bring it back up if that helps ... just don't want to do it as a test or the downgrade might become messy.

At the moment testing the upgrade and downgrading is really easy since the updated node simply does not join.

Could there be a flag in 5.1.1 that prevents joining a 5.1.0 , kinda marked as broken ?

No there is not AFAIK but for sure something weird is happening.

I'll try to reproduce your case in the next hours to see if I can find anything.

Thanks for opening that. May be it is worth opening the issue on github in the meantime and link to this discussion?

Will do, thanks

Let me know if there is any other information you need

I am running on Centos 6.8 , java version posted in first comment

config :

cluster.name: infra_elk
node.name: infra-elk-es1.lon-dc.mintel.ad
path.data: /data/elasticsearch/data
path.logs: /data/elasticsearch/logs
bootstrap.memory_lock: true
http.host: 127.0.0.1
transport.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["infra-elk-es1.lon-dc.mintel.ad", "infra-elk-es4.lon-dc.mintel.ad", "infra-elk-es3.lon-dc.mintel.ad"]
discovery.zen.minimum_master_nodes: 2
http.cors.enabled: true
http.cors.allow-origin: "/.*/"

In jvm.options only setting heap , everything else is as shipped from the rpm

-Xms8g
-Xmx8g
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-XX:+AlwaysPreTouch
-server
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-Djna.nosys=true
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Dlog4j.skipJansi=true
-XX:+HeapDumpOnOutOfMemoryError

Github Case - 22113

a full cluster restart wih upgrade seem to have worked fine.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.