Out Of Sync Replica

Hi,

We've been getting a few errors recently due to 'out of sync replicas' which is affecting certain cluster operations. The issues seem to tie in with an upgrade from 6.1.2 to 6.2.3 however could be completely unrelated.

We have a curator job that runs in the early morning each day however it's recently been failing due to out of sync replicas on the same particular index:

2018-04-03 01:00:39,310 ERROR     Failed to complete action: close.  <class 'curator.exceptions.FailedExecution'>: Exception encountered.  Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(409, u'{"_shards":{"total":44,"successful":41,"failed":3},".monitoring-logstash-6-2018.04.02":{"total":2,"successful":2,"failed":0},".monitoring-kibana-6-2018.04.02":{"total":2,"successful":2,"failed":0},"beats-2018.04.02":{"total":4,"successful":4,"failed":0},"metrics20-2018.04.02":{"total":20,"successful":20,"failed":0},"applicationlogs-2018.04.02":{"total":12,"successful":9,"failed":3,"failures":[{"shard":3,"reason":"out of sync replica; num docs on replica [23209297]; num docs on primary [23209299]","routing":{"state":"STARTED","primary":false,"node":"x9vWEhARS5ef63_YWknIlQ","relocating_node":null,"shard":3,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"4FfgsYKxQUa6yH64AX1ybg"}}},{"shard":1,"reason":"out of sync replica; num docs on replica [23212334]; num docs on primary [23212335]","routing":{"state":"STARTED","primary":false,"node":"XuLt8WgrSDeTBlz4-kC7sg","relocating_node":null,"shard":1,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"0koS5ZM8R0aIiDoXhzqziw"}}},{"shard":5,"reason":"out of sync replica; num docs on replica [23213113]; num docs on primary [23213114]","routing":{"state":"STARTED","primary":false,"node":"1JEjyXKtTQuwOThziCwhMw","relocating_node":null,"shard":5,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"3t_zNITkSwi-VAFQwPmuyg"}}}]},".monitoring-es-6-2018.04.02":{"total":2,"successful":2,"failed":0},"asm-2018.04.02":{"total":2,"successful":2,"failed":0}}')

  "applicationlogs-2018.04.02": {
"total": 12,
"successful": 9,
"failed": 3,
"failures": [
  {
    "shard": 3,
    "reason": "out of sync replica; num docs on replica [23209297]; num docs on primary [23209299]",
    "routing": {
      "state": "STARTED",
      "primary": false,
      "node": "x9vWEhARS5ef63_YWknIlQ",
      "relocating_node": null,
      "shard": 3,
      "index": "applicationlogs-2018.04.02",
      "allocation_id": {
        "id": "4FfgsYKxQUa6yH64AX1ybg"
      }
    }
  },
  {
    "shard": 1,
    "reason": "out of sync replica; num docs on replica [23212334]; num docs on primary [23212335]",
    "routing": {
      "state": "STARTED",
      "primary": false,
      "node": "XuLt8WgrSDeTBlz4-kC7sg",
      "relocating_node": null,
      "shard": 1,
      "index": "applicationlogs-2018.04.02",
      "allocation_id": {
        "id": "0koS5ZM8R0aIiDoXhzqziw"
      }
    }
  },
  {
    "shard": 5,
    "reason": "out of sync replica; num docs on replica [23213113]; num docs on primary [23213114]",
    "routing": {
      "state": "STARTED",
      "primary": false,
      "node": "1JEjyXKtTQuwOThziCwhMw",
      "relocating_node": null,
      "shard": 5,
      "index": "applicationlogs-2018.04.02",
      "allocation_id": {
        "id": "3t_zNITkSwi-VAFQwPmuyg"
      }
    }
  }
]

},

The only topics I've come across online mention it could be due to deleted documents but I think that can be ruled out as this index is purely additive, nothing is ever removed. The job above is to close the indices (Before compression is applied, routing to warm nodes, force merge etc.), and this is running an hour after the index has stopped being written to (as it rolls daily).

This has been fairly consistently happening recently and I'm not sure why -- The current fix has been to reduce replicas to 0, perform the operations needed, then increase replicas back to 1. The settings for the index are below (taken from todays index):

{
"applicationlogs-2018.04.03": {
    "settings": {
        "index": {
            "codec": "best_compression",
            "routing": {
                "allocation": {
                    "require": {
                        "box_type": "hot"
                    }
                }
            },
            "refresh_interval": "30s",
            "number_of_shards": "6",
            "translog": {
                "flush_threshold_size": "1gb",
                "sync_interval": "15s",
                "durability": "async"
            },
            "provided_name": "applicationlogs-2018.04.03",
            "creation_date": "1522713600418",
            "store": {
                "type": "mmapfs"
            },
            "unassigned": {
                "node_left": {
                    "delayed_timeout": "20m"
                }
            },
            "number_of_replicas": "1",
            "uuid": "UKteUDw_Rl-pimKXiwTwpg",
            "version": {
                "created": "6020399"
            }
        }
    }
}

I've also verified the cluster has been green at the time of the errors, and there were no ongoing rebalance operations (From looking at the monitoring page)

Any help would be much appreciated

Cheers,
Mike

1 Like

I think the change in behaviour that you've seen is a consequence of #28464 which was merged into 6.2.2. Prior to that change we were lenient in our handling of a document count mismatch when performing a synced flush, so although the replicas may have been out of sync this didn't raise any errors until much later on.

The workaround (reducing replicas to 0 and then back to 1) rebuilds all the replicas which brings them back into sync. You can increase the number of replicas back to 1 before taking any other actions.

However, we would like to dig into why this is happening some more. The replica is, for some reason, falling out of sync with the primary, and this is not supposed to happen. Can you grab the logs from all of your nodes from 2018-04-01, 2018-04-02 and 2018-04-03, zip them up and email them to me (firstname.lastname@elastic.co)?

You say that this index is write-only - do you mean that you only index documents in and never update or delete them? Do you use auto-generated IDs or do you provide your own? Do you use any versioned operations? I take it from the name that these are application logs so I can predict the answers to these questions, but it's good to know for sure.

Roughly how often do you see this occur? We may want to ask you to increase some logging verbosity and wait for it to happen again.

Hi David,

I've sent over some of the logs as requested, although as mentioned in the email we don't have the raw logs as they're shipped straight to Elastic.

you only index documents in and never update or delete them?

Correct

Do you use auto-generated IDs or do you provide your own?

Autogenerated

Do you use any versioned operations?

Assuming this is what you mean: Index API | Elasticsearch Guide [8.11] | Elastic, we don't do anything like that.

Roughly how often do you see this occur?

We've been seeing it on a near daily basis on a couple of different indices. I'd say since upgrading a week or so ago we've seen it every day except one or two.

Thanks for looking into this,
Mike

I don't seem to have received anything by email. Did it bounce, perhaps because it exceeded a size limit?

Just mailed attempt #2 :slight_smile:

Received, and shared with the team, thanks.

Thanks @Evesy for sharing. I'll post the relevant parts here so other people can profit.

Next to document count discrepancy, there is also a worrisome symptom in the stats - a completion marker called the local checkpoint (all ops with a seq number below it were processed by the shard copy) is lagging:

    "3": [
        {
          "primary": true,
          "seq_no": {
            "max_seq_no": 20495054,
            "local_checkpoint": 20495054,
            "global_checkpoint": 10280512
          }
        },
        {
          "primary": false,
          "seq_no": {
            "max_seq_no": 20495054,
            "local_checkpoint": 10280512,  <--- THIS
            "global_checkpoint": 10280512
          }
        }
      ],

Can you please double check that the docs with the seq# just above the lagging local_checkpoint are actually missing? you can do this with a search like this:

GET INDEX_HERE/_search
{
  "query": {
    "range": {
      "_seq_no": {
        "gte": 10280511
      }
    }
  }
  , "sort": [
    {
      "_seq_no": {
        "order": "asc"
      }
    }
  ],
  "size": 5
}

Hey Boaz,

I will be able to grab the results of a search from an index with a lagging checkpoint today or tomorrow for you, as the original index I posted about has been 'fixed' by scaling replicas down and back up again.

I do have another issue open around a separate issue with similar symptoms (Replica Shards Much Larger Than Primaries) that might be useful for you

Cheers,
Mike

Thanks @Evesy . I forgot to mention that you can use preferences to guide the search at the right shard and the right shard copy - https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-preference.html

Hey @bleskes

Apologies for not getting back sooner, we've been (un)fortunate in that we haven't hit the out of sync replicas issue in the last couple of days, however I'm going to enable best_compression on the main culprit index as that's what we mainly had the codec set to when facing these issues.

I have however had additional instances of local checkpoints on replicas lagging severely behind, so I've include an example below (The replicas weren't 'out of sync' though as I was able to successfully close the index without any complaints).

Shard stats (https://gist.github.com/Evesy/ace6bc72a838073d9406f71618ccaaa7)

"5": [
  {
    "primary": true,
    "translog": {
      "operations": 9735702,
      "size_in_bytes": 12959541735,
      "uncommitted_operations": 60867,
      "uncommitted_size_in_bytes": 74505635
    },
    "seq_no": {
      "max_seq_no": 19536505,
      "local_checkpoint": 19536505,
      "global_checkpoint": 9824904
    },
    "flush": {
      "total": 24,
      "total_time_in_millis": 11578
    }
  },
  {
    "primary": false,
    "translog": {
      "operations": 10227174,
      "size_in_bytes": 13677234938,
      "uncommitted_operations": 9720880,
      "uncommitted_size_in_bytes": 12938974155
    },
    "seq_no": {
      "max_seq_no": 19536504,
      "local_checkpoint": 9824904,
      "global_checkpoint": 9824904
    },
    "flush": {
      "total": 11628,
      "total_time_in_millis": 17479466
    }
  }
]

The index was still receiving documents at the time but I did a normal search (against shard 5) from sequence number 9824903, and then a search aimed at the replica:

Shard search output: https://gist.github.com/Evesy/0a5aa0cc784a5ca7d3be47ba3a49eb39

https://{{url}}/applicationlogs-2018.04.10/_search?preference=_shards:5

"hits": {
  "total": 9756942,

Replica shard search output: https://gist.github.com/Evesy/6323892e141e109384dbb890c0e6fecf

https://{{url}}/applicationlogs-2018.04.10/_search?preference=_shards:5|_replica

"hits": {
  "total": 9742914,

Not sure this is too helpful but the replica shard at least seems to have at least similar number of docs as the primary, however when I've had the out of sync issues it's always only been a matter of <10 docs difference.

Hopefully this issue occurs again soon and I can capture relevant information.

Cheers,
Mike

Thanks Mike.

Sadly operations with seq# 9824905 (this is the sort value) is missing from both searches. I wanted to see if it exists on the primary. Since you didn't use a node preference/primary preference on the first search I can't be sure where the results came from and whether 9824905 misses everywhere or just on the replica. Can you please repeat?

Hopefully this one is better!

Shard stats: https://gist.github.com/Evesy/67db26b42cc02989167e2eaab1779864

  "2": [
    {
      "primary": true,
      "translog": {
        "operations": 22620589,
        "size_in_bytes": 29982120755,
        "uncommitted_operations": 140959,
        "uncommitted_size_in_bytes": 185273495
      },
      "seq_no": {
        "max_seq_no": 22620588,
        "local_checkpoint": 22620588,
        "global_checkpoint": 692983
      },
      "flush": {
        "total": 28,
        "total_time_in_millis": 12042
      }
    },
    {
      "primary": false,
      "translog": {
        "operations": 22620588,
        "size_in_bytes": 29982967130,
        "uncommitted_operations": 21961385,
        "uncommitted_size_in_bytes": 29177640646
      },
      "seq_no": {
        "max_seq_no": 22620588,
        "local_checkpoint": 692983,
        "global_checkpoint": 692983
      },
      "flush": {
        "total": 20121,
        "total_time_in_millis": 49395113
      }
    }
  ]

Performed a search against Shard 2 with the below payload:

{
  "query": {
    "range": {
      "_seq_no": {
        "gte": 692982
      }
    }
  }
  , "sort": [
    {
      "_seq_no": {
        "order": "asc"
      }
    }
  ],
  "size": 1
}

Results for the two queries:

/_search?preference=_shards:2|_replica

/_search?preference=_shards:2|_primary

Thanks. Sadly since you've set the size to 1, we can't tell what's up with the seq# local checkpoint + 1 = 692984 . Can you try again?

@bleskes Sorry about that, it's only just clicked what you're looking for.

I've updated the previous posts & gists accordingly (Source stripped out due to sensitive data). It does indeed seem 692984 is present on the primary but missing on the replica

Thanks again

Ok. So this confirms this is not a accounting error and that the doc is really missing. Any chance you can correlate when the document with seq# 692984 was indexed and try see what happened in the cluster at that time? I'm looking for stuff like shard recovery/relocation, network issues etc.

Also, can you share your index and node settings with me? you can send them to my name at elastic.co

I've sent you index/node settings

In regards to ongoing cluster operations at the time, the documents either side of the missing document have a timestamp of ~01:05 in the morning, and there was no delay in Logstash processing during that time so I think it's safe to assume the document was indexed into Elasticsearch within a minute of that time.

We kick off a daily curator at 1:00 so some bits that will have happened around that time:

  • A number of indices deleted (Living on warm nodes)
  • Some of previous days indices closed, codec changed & reopened (Living on hot nodes)
  • Update allocation on a number of nodes to move them to warm nodes (From hot nodes)

(The index we see the issue with is on the hot nodes)

So it's safe to say there will have been a lot of relocation and network activity at the time this document was indexed.

We have however seen this on an index that initially started with 0 replicas but was increased to 1 in the morning (after curator will have completed); couldn't say for sure about any cluster recoveries/relocations in this instance though.

Cheers,
Mike

Was this done on the index in questions?

Nope, just previous days indices

Regarding the stats I sent yesterday for applicationlogs-20180.04.12, our curator failed this morning as the documents were still missing:

"applicationlogs-2018.04.12":{"total":12,"successful":10,"failed":2,"failures":[{"shard":2,"reason":"out of sync replica; num docs on replica [29982357]; num docs on primary [29982358]","routing":{"state":"STARTED","primary":false,"node":"1JEjyXKtTQuwOThziCwhMw","relocating_node":null,"shard":2,"index":"applicationlogs-2018.04.12","allocation_id":{"id":"KDwNuAj9T2Wref2DD_HgLw"}}},{"shard":1,"reason":"out of sync replica; num docs on replica [30019545]; num docs on primary [30019547]","routing":{"state":"STARTED","primary":false,"node":"XuLt8WgrSDeTBlz4-kC7sg","relocating_node":null,"shard":1,"index":"applicationlogs-2018.04.12","allocation_id":{"id":"47CzuxN6Tq2AcqHFAyCkMg"}}}]},"

Shard 2's local checkpoint is still at the same point as it was in the stats above, document with sort value 692984 is still missing on the replica.

Shard 1has a similar problem but with two missing documents -- Not sure what the second document is though.