Out Of Sync Replica

Ok. Thanks. I don't expect the solution to fix it self once it occurs. The documents have simply failed to arrive to the replica.

I'm still looking at the data you sent. Two more questions in the meantime:

  1. How often does the issue occur - is it daily?
  2. I see your translog durability is set to async. Is it an option for to shift to sync and see if it helps?

One more ask - can you share the output of GET INDEX/_recovery?detailed&human for the index in question? I'm looking for the recovery information of the replica that's missing an op.

How often does the issue occur - is it daily?

I'd say the issue has been happening about 50% of the days since we upgraded. On the last couple of days we didn't see the issue I think the codec was default, it's back to best compression now though.

I see your translog durability is set to async. Is it an option for to shift to sync and see if it helps?

Will do, although for our bigger indices we wouldn't be able to run without async full time, having async helps a lot with IO.

can you share the output of GET INDEX/_recovery?detailed&human for the index in question?

New index but similar issue, document with sort 15884958 (Shard 2) is missing:

GET /_stats?level=shards
GET /_recovery?detailed&human

Thanks. Your recovery info shows that the replica was recovered when it was empty and never relocated. Primary relocation also didn't happen. That helps tremendously to reduce the search space.

On the last couple of days we didn't see the issue I think the codec was default, it's back to best compression now though.

It would be very surprising if the codec was at fault.

Will do, although for our bigger indices we wouldn't be able to run without async full time, having async helps a lot with IO.

Thanks!

Keep us posted.

@Evesy how are things going? is this still an issue with the new settings?

Hey Boaz,

We've definitely observed the issue less when using the default codec rather than best compression but we do still see it a couple of times a week.

I haven't had a chance yet to disable async as I want to make sure I'm around to monitor things when we change it -- But I will have it in place for tomorrow so I'll see how things go over the weekend etc.

Cheers,
Mike

Thanks Mike.

We got the same error when using version 6.2.2 and here is the flush sync result:

    {
      "_shards": {
        "total": 10,
        "successful": 5,
        "failed": 5
      },
      "test_binlog_event_2018-04-24": {
        "total": 10,
        "successful": 5,
        "failed": 5,
        "failures": [
          {
            "shard": 3,
            "reason": "out of sync replica; num docs on replica [87]; num docs on primary [97]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "SXrqPc2XQPCF68uTI9BnYg",
              "relocating_node": null,
              "shard": 3,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "a3bV_lHRRoikp5p4k9vmvw"
              }
            }
          },
          {
            "shard": 1,
            "reason": "out of sync replica; num docs on replica [98]; num docs on primary [119]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "j88PNMEHQKKja18HsHlXqw",
              "relocating_node": null,
              "shard": 1,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "RaKMm3cfTiqqwtmPlkIQRg"
              }
            }
          },
          {
            "shard": 0,
            "reason": "out of sync replica; num docs on replica [105]; num docs on primary [115]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "BHGBPYvyRxCdallV4VzZoQ",
              "relocating_node": null,
              "shard": 0,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "aS-UZCNyRw21gnRVWomruw"
              }
            }
          },
          {
            "shard": 4,
            "reason": "out of sync replica; num docs on replica [99]; num docs on primary [106]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "B3KrLbcCTdmMlrXnw6XNEw",
              "relocating_node": null,
              "shard": 4,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "T41VwaIISmuLTMvpDFcPjA"
              }
            }
          },
          {
            "shard": 2,
            "reason": "out of sync replica; num docs on replica [114]; num docs on primary [122]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "of9qRxVKToGlra8WJaN-0Q",
              "relocating_node": null,
              "shard": 2,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "PKFLj7fAR4SU4i1SEu7yYQ"
              }
            }
          }
        ]
      }
    }

Any ideas on how to fix this?

1 Like

Total doc counts on primary shards is 559 , on replicas there are 503 docs.

GET test_binlog_event_2018-04-24/_search?preference=_primary
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 559,
    "max_score": 1,
    "hits": [
      {
      ...
       }
      ]
  }
}
GET test_binlog_event_2018-04-24/_search?preference=_replica
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 503,
    "max_score": 1,
    "hits": [
      {
      ...
      }
      ]
  }
}

and seq_no.local_point on replica shard 2 and 4 is -1(maybe more shards),
shard 2 stats https://gist.github.com/Lunatictwo/ddbe9be06290ba9b3b6872f38bbe7b16
shard 4 stats https://gist.github.com/Lunatictwo/3060a81480fdab8b58b4516f83a84074

If set replica num to 0 then set to 1 would fix this error but I want know if it's a bug on version 6.2.2 and if you need more info, it's glad to offer you.@bleskes

Thanks!

Steven

Thanks @lunatic . I want to double check - when made the synced flush call, was indexing stopped? if not it's expected to fail. Regarding the searches - was indexing stopped there and did you run refresh?

I presume that the above is OK. Can you reproduce this reliably? since your index is so small, this allows us to enable trace logging which will help immensely in find the source of the problem. If so, please run the following command on the cluster and try to reproduce. When you do, please send me the logs to first name at elastic.co and please also send the shard stats so I will know which operations to look at.

This is the command to enable the logs:

curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger.org.elasticsearch.action.bulk": "TRACE",
    "logger.org.elasticsearch.action.support.replication": "TRACE", 
    "logger.org.elasticsearch.cluster.service": "TRACE",
    "logger.org.elasticsearch.indices.recovery": "TRACE", 
    "logger.org.elasticsearch.index.shard": "TRACE"
  }
}'

Thanks @bleskes.
Here are replies of your questions:

When made the synced flush call, was indexing stopped?
A: Yes, indexing stopped when flush

Was indexing stopped there and did you run refresh?
A: I ran refresh but still errors.

I will try to reproduce the error and I'll send you the log later.

@bleskes hi Boaz. I've send you the logs.

@Evesy the logs @lunatic supplied give me enough of a beginning of a thread and I think I found the source of the issue. I don't think it's needed for you to lose many hours on experiments. But you can confirm this: can you check your logs for failures around mapping updates? Grep for put-mapping or put mappings .

1 Like

@bleskes I get 4 hits for put-mapping over the last 14 days; all with messages pretty much equivalent of the below:

[2018-04-14T00:51:49,666][WARN ][o.e.c.s.MasterService ] [elasticsearch-master-0] cluster state update task [put-mapping[application-logs]] took [30.1s] above the warn threshold of 30s

@Evesy @lunatic thank you for the research. I opened https://github.com/elastic/elasticsearch/pull/30244 with a proposed fix. It needs to be reviewed but you can follow progress on github.

1 Like

I see it, thanks @bleskes

@bleskes Cheers for looking into this -- Do you think this & Replica Shards Much Larger Than Primaries are down to the same root cause (As we do seem to notice they happen together), or does it look like a separate issue to you?

Thanks

It might. A lagging local checkpoint will prevent translog clean up as of version 6.2.0 : https://github.com/elastic/elasticsearch/pull/27606

Looks like this problem is occurring on 6.2.4 Usually it's not a problem but when I have to bounce a node the shards that have this 'corruption' have to completely rebuild again vs just copying the translog. I also am ingesting about 1TB per hour. If I can help with any details I'd be glad to assist.

@djetcha if you see in your logs errors with put-mapping you can be pretty sure it's https://github.com/elastic/elasticsearch/pull/30244