we have been using ES productively for quite some time now but today we came across a new problem that we have never seen before:
We found two indices in our ES 6.8.14 cluster, each with one shard where primary and replica show different sync_ids and vastly different document counts. I reckon that means they are totally out of sync. I am having a hard time understanding how the cluster health can be "green" under these circumstances, though.
curl -H 'Content-Type: application/json' -XGET "prod-db01:9200/_cat/shards/archive1599113539" archive1599113539 2 r STARTED 6720 40.4mb 10.0.82.232 prod-db01 archive1599113539 2 p STARTED 656 4.3mb 10.0.82.233 prod-db02
As you can see, the document count is far greater on the replica. Since it looks like the problem has gone undetected for weeks at least (since the cluster is "green") backups are probably unusable. Also, the usually recommended way of setting replicas to 0 and then back to 1 will probably result in significant data loss because there are more documents in the replica than in the primary.
I am sort of clueless how to deal with the situation and ask you:
- How can the cluster be green? Should this not be a bug?
- Is there ANY thinkable way of evaluating or dumping the data in primary and replica separately for a merge attempt to recover otherwise potentially lost data sets?
- How do I prevent this from happening again and do you have tips for detecting this type of failure?
Any help or tips would be greatly appreciated.