Rolling upgrade issue - ES 2.1.1 to ES 2.4.1

Hi there,
We are trying rolling upgrade of our clusters from ES 2.1.1 to ES 2.4.1. What we noticed is once we move the 2nd data node to ES 2.4.1, not all shards in that node are able to come up and remain in unallocated state. Our cluster also remains in yellow state. Since we have logic of not going ahead with next node till cluster is green, we remain stuck in yellow state for days.
Any inputs?

Once a primary shard is allocated to a 2.4.1 node, a replica for that shard cannot be allocated to a 2.1.1 node anymore as the primary on the new node might have already written segments that use a new postings format or codec that is not available on the lower-version node. Can you check if this is the scenario you're seeing?

1 Like

Yup.. We know the issue now!

Closing the thread with the reason why we ran into this issue -

We have 3 data nodes and 2 replicas for each shard; each data node having 1 copy of the shard. To explain our issue with an example –

  • Assume the data nodes had something like this (R means replica, P means primary, # that follows P/R is shard number)
    • D1 – P1, R2, R3
    • D2 – R1, P2, R3
    • D3 – R1, R2, P3
  • D1 is taken out of cluster and upgraded
    • P1, R2, R3 not available
    • D3 is asked to promote R1 as P1
  • D1 come back
    • R2 and R3 were replicas and get initialized
    • P1 is now initialized as R1 since D3 has P1
    • The above 2 are possible because primary with old version and replicas with new version is fine
  • D2 is taken out of cluster and upgraded
    • R1, P2, R3 not available
    • D1 is asked to promote R2 as P2
    • D3’s R2 is not available as the D3 nodes’ ES version is lesser than D1
  • D2 come back
    • R1 and R3 were replicas and get initialized. This is possible because the primary is in node with ES version lower than current version.
    • P2 is initialized as R2 since D1 has P1 and the version of ES is same in both node

This will not happen when # of nodes in cluster <> # of shard copies.

thank you for taking the time to write up the explanation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.