[Solved] Rolling upgrade 7.6.1 -> 7.9.0 suddenly failed on node: Java error?

Hi... I'm doing a rolling upgrade... overall, never had an issue with this..
The cluster is 7.6.1 and I'm going to latest 7.9.0

All nodes are identical machines: Debian stable machines... so far so good.
I tried upgrading on the very first node: as expected, following documentation, No problem ...

Doing upgrade on a second node ... error: EL doesn't start.
Both machines are data nodes, basically clones ... so I see absolutely no difference beyond naming, etc...

Error at log is:

[2020-09-02T13:05:25,077][ERROR][o.e.b.Bootstrap ] [elk-data09] node validation exception
[1] bootstrap checks failed
[1]: JVM is using the serial collector but should not be for the best performance; either it's the default for the VM [OpenJDK 64-Bit Server VM] or -XX:+UseSerialGC was explicitly specified
[2020-09-02T13:05:25,093][INFO ][o.e.n.Node ] [elk-data09] stopping ...

no change done at jvm.options
both machines have same jvm.options file ... I've even compared with 'ps aux' command the running processes, they look the same

root@elk-data09:~# /usr/share/elasticsearch/jdk/bin/java -version
openjdk version "14.0.1" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 14.0.1+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 14.0.1+7, mixed mode, sharing)

NO other Java package present ... these are EL-dedicated machines...

What could I check?

Ok ... so according to documentation at: https://www.elastic.co/guide/en/elasticsearch/reference/current/_use_serial_collector_check.html

'the default JVM configuration that ships with Elasticsearch configures Elasticsearch to use the CMS collector.' ... but... why is then trying to run with the Serial Garbage collector?

I've seen NO trace of -XX:+UseSerialGC on the running processes ... Where does it get the configuration? .... I've tried to reinstall the package just in case something is wrong with downloaded binaries but no change..

Also I see on the process the options

 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly as present at jvm.options .... 

Now checking systemd I realize this:

OpenJDK 64-Bit Server VM warning: Ignoring option UseConcMarkSweepGC; support was removed in 14.0
Sep 02 14:12:34 elk-data09 systemd-entrypoint[12015]: OpenJDK 64-Bit Server VM warning: Ignoring option CMSInitiatingOccupancyFraction; support was removed in 14.0
Sep 02 14:12:34 elk-data09 systemd-entrypoint[12015]: OpenJDK 64-Bit Server VM warning: Ignoring option UseCMSInitiatingOccupancyOnly; support was removed in 14.0

.... something is wrong here ... The installation script by default respects present jvm.options file maybe it is outdated

Turns out to be the jvm.options file is outdated.

For some reason I'm still not understanding, one machine works with a config that is stated invalid on another.... the machines are initial VM clones, and upgraded in a roll .... I would dare to say something is not going perfect with the Java bootstrap...

One machine bootstrap with 'old' jvm.options file like (although complaining of deprecated options, it manages to go up):

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:G1ReservePercent=25
# 10-:-XX:InitiatingHeapOccupancyPercent=30

while on another (identical machine) I had to change it, or otherwise it doesn't goes up:

## GC configuration   <- now deprecated
#-XX:+UseConcMarkSweepGC
#-XX:CMSInitiatingOccupancyFraction=75
#-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.   <-Ok I'm 10 or later
# To use G1GC uncomment the lines below.
10-:-XX:-UseConcMarkSweepGC
10-:-XX:-UseCMSInitiatingOccupancyOnly
10-:-XX:+UseG1GC
10-:-XX:G1ReservePercent=25
10-:-XX:InitiatingHeapOccupancyPercent=30

Hope this helps to others

Thanks for sharing your solution!

Hi.

As a better way to solve this (getting completely rid of those warnings on systemd output) I found jvm.options file at elasticsearch github/sources here:

By using the GC settings al the end of the file, it all works with no errors :slight_smile:

I would suggest that the packaged deb script would try to replace jvm.options file (prompting the user to allow overwrite, just like filebeat does with filebeat.yml and so on...) giving the administrator a clue that the Java stuff on the package contains changes to be aware of

Thank you very much.
Best regards.