Cluster yellow after upgrade from 7.16.2 to 7.17.7

Hi, I started a rolling upgrade as per directions from the guide. I have 6 data nodes (n1 and n2 are also ingest nodes) and after successfully upgrading n4 I upgraded n6 and the cluster is unable to get from yellow to green. After working through the "Red or yellow cluster status" page it looks like there is no place to allocate a replica of one of the shards for a new index.

I am running with the options

  • cluster.routing.allocation.awareness.force.rack.values: r1, r2, r3
  • cluster.routing.allocation.awareness.attributes: rack

and node.attr.rack options for my nodes are configured as:

  • n1 node.attr.rack [r1]
  • n2 node.attr.rack [r2]
  • n3 node.attr.rack [r1]
  • n4 node.attr.rack [r2]
  • n5 node.attr.rack [r1]
  • n6 node.attr.rack [r2]

cluster allocation explanation gives me:

n1 - node_version (can't allocate replica shard to older version)  
n2 - node_version (can't allocate replica shard to older version) and awareness (already copy on r2 and per cluster routing allocation awareness the second copy needs to be on r1 or r3)
n3 - node_version (can't allocate replica shard to older version)
n4 - same_shard (already holds a copy) and awareness (already copy on r2 and per cluster routing allocation awareness the second copy needs to be on r1 or r3) 
n5 - node_version (can't allocate replica shard to older version)
n6 - awareness (already copy on r2 and per cluster routing allocation awareness the second copy needs to be on r1 or r3)

My mistake in is that without thinking I started the upgrade with two consecutive r2 nodes i.e., n4 and n6 and if I had done one form r2 and another from r1 e.g., n4 and n5 I would not be in this mess at the moment.

Also, after the upgrade there is an issue on n6 with JVM heap memory. When I run /usr/share/elasticsearch/bin/elasticsearch --version I get:

Exception in thread "main" java.lang.RuntimeException: starting java failed with [1]
output:
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 34359738368 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /var/log/elasticsearch/hs_err_pid53775.log
error:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fe0b4000000, 34359738368, 0) failed; error='Not enough space' (errno=12)
        at org.elasticsearch.tools.launchers.JvmOption.flagsFinal(JvmOption.java:119)
        at org.elasticsearch.tools.launchers.JvmOption.findFinalOptions(JvmOption.java:81)
        at org.elasticsearch.tools.launchers.JvmErgonomics.choose(JvmErgonomics.java:38)
        at org.elasticsearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:135)
        at org.elasticsearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:86)

Despite the error the node seems to work and it is joined to the cluster. The node has 64GB memory and jvm.options already have -Xms32g and -Xmx32g.

So, I have two issues. For the first one - what are my options?

  1. Can I just upgrade n5 (or n1 or n3) to 7.17.7 despite the yellow cluster status?
  2. Do I change the cluster routing allocation settings temporarily?

What about the second issue? I can't just throw more RAM at it, can't I?

I hope to get my cluster to green so I can get on with upgrading. My final goal is to upgrade to 8.5

Yellow means unassigned replicas, so if one of those other nodes has the primary on them you might want to make sure you can assign those replicas so that if something happens to the nodes with the primaries during the upgrade you have a recovery point.

It's totally ok to change allocation filtering temporarily.

I ended up just updating n5 which was in a different "rack". After the upgrade there was some longer re-indexing/reallocation happening but in 30 minutes everything settled in green. I was able to upgrade to 8.5.3 after some small mishaps (I inherited the cluster with little documentation so I missed some other special settings - although the rack attribute thing was the most time consuming to solve).

The JVM issue still exists, but as it seems not to effect the cluster health and stability I will revisit the issue at a later date. I needed the binary executable for version determination in my ansible upgrade playbook task, but I opted to querying the API instead.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.