Cluster "green" - but shards not in sync !? 😯

Hello all,

we have been using ES productively for quite some time now but today we came across a new problem that we have never seen before:

We found two indices in our ES 6.8.14 cluster, each with one shard where primary and replica show different sync_ids and vastly different document counts. I reckon that means they are totally out of sync. I am having a hard time understanding how the cluster health can be "green" under these circumstances, though.

curl -H 'Content-Type: application/json' -XGET "prod-db01:9200/_cat/shards/archive1599113539"
archive1599113539 2 r STARTED 6720 40.4mb 10.0.82.232 prod-db01
archive1599113539 2 p STARTED  656  4.3mb 10.0.82.233 prod-db02

As you can see, the document count is far greater on the replica. Since it looks like the problem has gone undetected for weeks at least (since the cluster is "green") backups are probably unusable. Also, the usually recommended way of setting replicas to 0 and then back to 1 will probably result in significant data loss because there are more documents in the replica than in the primary.

I am sort of clueless how to deal with the situation and ask you:

  1. How can the cluster be green? Should this not be a bug?
  2. Is there ANY thinkable way of evaluating or dumping the data in primary and replica separately for a merge attempt to recover otherwise potentially lost data sets?
  3. How do I prevent this from happening again and do you have tips for detecting this type of failure?

Any help or tips would be greatly appreciated.

Can you share GET /archive1599113539/_stats?level=shards please?

Sorry but the output you requested exceeds the max size for a posting here.

You can find the data you requested here: shard stats - Pastebin.com

Thanks, yes, that looks broken. How many master nodes are there in your cluster? How is discovery.zen.minimum_master_nodes configured?

There seems to be no such setting at all. It is a three node cluster. All nodes have roles 'master', 'data' and 'ingest' set.
This is the contents of the discovery setting:

        "discovery" : {
          "zen" : {
            "ping" : {
              "unicast" : {
                "hosts" : [
                  "10.0.82.232",
                  "10.0.82.233",
                  "10.0.82.234"
                ]
              }
            }
          }
        },

Ok if you have not set discovery.zen.minimum_master_nodes then that would explain it. There should be warnings about it in your logs, looking like this:

value for setting "discovery.zen.minimum_master_nodes" is too low. This can result in data loss!

1 Like

Thank you, set it to 2 now. It was commented out for reasons we could not reconstruct.
I found one such warning in a log backup from 2020-06 :slight_smile:

Any more ideas about what can be done to recover as much data as possible? :slight_smile:

Nothing very easy or robust, sorry. You could try using search preference to extract the contents of each shard so you can compare them. You'll need to do that for every shard, even the ones with matching doc counts, just to check that they really have the same docs in them and nothing got messed up in their mappings either.

Thanks again! It looks like we are going to be able to recover most of the data.

I'd like to get back to the original question, though: I still fail to comprehend how inconsistent replicas are not a sufficient condition to trigger a health warning?

It sort of does. The cluster health depends only on whether the shards are assigned or not, and the assignment process includes checks to make sure that all the copies are in sync. Unfortunately by configuring discovery.zen.minimum_master_nodes wrongly you end up with the information about which copies are in sync itself being out of sync, but there's not really a way to address that in general.

This is fixed in 7.x, in the sense that it is no longer possible to misconfigure Elasticsearch to lose data in this fashion.