Out Of Sync Replica

bleskes · April 13, 2018, 8:39pm

Ok. Thanks. I don't expect the solution to fix it self once it occurs. The documents have simply failed to arrive to the replica.

I'm still looking at the data you sent. Two more questions in the meantime:

How often does the issue occur - is it daily?
I see your translog durability is set to async. Is it an option for to shift to sync and see if it helps?

bleskes · April 13, 2018, 8:51pm

One more ask - can you share the output of GET INDEX/_recovery?detailed&human for the index in question? I'm looking for the recovery information of the replica that's missing an op.

Evesy · April 13, 2018, 9:40pm

How often does the issue occur - is it daily?

I'd say the issue has been happening about 50% of the days since we upgraded. On the last couple of days we didn't see the issue I think the codec was default, it's back to best compression now though.

I see your translog durability is set to async. Is it an option for to shift to sync and see if it helps?

Will do, although for our bigger indices we wouldn't be able to run without async full time, having async helps a lot with IO.

can you share the output of GET INDEX/_recovery?detailed&human for the index in question?

New index but similar issue, document with sort 15884958 (Shard 2) is missing:

GET /_stats?level=shards
GET /_recovery?detailed&human

bleskes · April 15, 2018, 5:40pm

Thanks. Your recovery info shows that the replica was recovered when it was empty and never relocated. Primary relocation also didn't happen. That helps tremendously to reduce the search space.

On the last couple of days we didn't see the issue I think the codec was default, it's back to best compression now though.

It would be very surprising if the codec was at fault.

Will do, although for our bigger indices we wouldn't be able to run without async full time, having async helps a lot with IO.

Thanks!

Keep us posted.

bleskes · April 24, 2018, 8:41am

@Evesy how are things going? is this still an issue with the new settings?

Evesy · April 24, 2018, 1:03pm

Hey Boaz,

We've definitely observed the issue less when using the default codec rather than best compression but we do still see it a couple of times a week.

I haven't had a chance yet to disable async as I want to make sure I'm around to monitor things when we change it -- But I will have it in place for tomorrow so I'll see how things go over the weekend etc.

Cheers,
Mike

bleskes · April 24, 2018, 1:19pm

Thanks Mike.

lunatic · April 24, 2018, 1:19pm

We got the same error when using version 6.2.2 and here is the flush sync result:

    {
      "_shards": {
        "total": 10,
        "successful": 5,
        "failed": 5
      },
      "test_binlog_event_2018-04-24": {
        "total": 10,
        "successful": 5,
        "failed": 5,
        "failures": [
          {
            "shard": 3,
            "reason": "out of sync replica; num docs on replica [87]; num docs on primary [97]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "SXrqPc2XQPCF68uTI9BnYg",
              "relocating_node": null,
              "shard": 3,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "a3bV_lHRRoikp5p4k9vmvw"
              }
            }
          },
          {
            "shard": 1,
            "reason": "out of sync replica; num docs on replica [98]; num docs on primary [119]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "j88PNMEHQKKja18HsHlXqw",
              "relocating_node": null,
              "shard": 1,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "RaKMm3cfTiqqwtmPlkIQRg"
              }
            }
          },
          {
            "shard": 0,
            "reason": "out of sync replica; num docs on replica [105]; num docs on primary [115]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "BHGBPYvyRxCdallV4VzZoQ",
              "relocating_node": null,
              "shard": 0,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "aS-UZCNyRw21gnRVWomruw"
              }
            }
          },
          {
            "shard": 4,
            "reason": "out of sync replica; num docs on replica [99]; num docs on primary [106]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "B3KrLbcCTdmMlrXnw6XNEw",
              "relocating_node": null,
              "shard": 4,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "T41VwaIISmuLTMvpDFcPjA"
              }
            }
          },
          {
            "shard": 2,
            "reason": "out of sync replica; num docs on replica [114]; num docs on primary [122]",
            "routing": {
              "state": "STARTED",
              "primary": false,
              "node": "of9qRxVKToGlra8WJaN-0Q",
              "relocating_node": null,
              "shard": 2,
              "index": "test_binlog_event_2018-04-24",
              "allocation_id": {
                "id": "PKFLj7fAR4SU4i1SEu7yYQ"
              }
            }
          }
        ]
      }
    }

Any ideas on how to fix this?

lunatic · April 25, 2018, 4:10am

Total doc counts on primary shards is 559 , on replicas there are 503 docs.

GET test_binlog_event_2018-04-24/_search?preference=_primary
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 559,
    "max_score": 1,
    "hits": [
      {
      ...
       }
      ]
  }
}

GET test_binlog_event_2018-04-24/_search?preference=_replica
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 503,
    "max_score": 1,
    "hits": [
      {
      ...
      }
      ]
  }
}

and seq_no.local_point on replica shard 2 and 4 is -1(maybe more shards),
shard 2 stats https://gist.github.com/Lunatictwo/ddbe9be06290ba9b3b6872f38bbe7b16
shard 4 stats https://gist.github.com/Lunatictwo/3060a81480fdab8b58b4516f83a84074

If set replica num to 0 then set to 1 would fix this error but I want know if it's a bug on version 6.2.2 and if you need more info, it's glad to offer you.@bleskes

Thanks!

Steven

bleskes · April 25, 2018, 6:22pm

Thanks @lunatic . I want to double check - when made the synced flush call, was indexing stopped? if not it's expected to fail. Regarding the searches - was indexing stopped there and did you run refresh?

I presume that the above is OK. Can you reproduce this reliably? since your index is so small, this allows us to enable trace logging which will help immensely in find the source of the problem. If so, please run the following command on the cluster and try to reproduce. When you do, please send me the logs to first name at elastic.co and please also send the shard stats so I will know which operations to look at.

This is the command to enable the logs:

curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "logger.org.elasticsearch.action.bulk": "TRACE",
    "logger.org.elasticsearch.action.support.replication": "TRACE", 
    "logger.org.elasticsearch.cluster.service": "TRACE",
    "logger.org.elasticsearch.indices.recovery": "TRACE", 
    "logger.org.elasticsearch.index.shard": "TRACE"
  }
}'

lunatic · April 26, 2018, 2:50am

Thanks @bleskes.
Here are replies of your questions:

When made the synced flush call, was indexing stopped?
A: Yes, indexing stopped when flush

Was indexing stopped there and did you run refresh?
A: I ran refresh but still errors.

I will try to reproduce the error and I'll send you the log later.

lunatic · April 26, 2018, 8:23am

@bleskes hi Boaz. I've send you the logs.

bleskes · April 26, 2018, 12:15pm

@Evesy the logs @lunatic supplied give me enough of a beginning of a thread and I think I found the source of the issue. I don't think it's needed for you to lose many hours on experiments. But you can confirm this: can you check your logs for failures around mapping updates? Grep for put-mapping or put mappings .

Evesy · April 26, 2018, 1:47pm

@bleskes I get 4 hits for put-mapping over the last 14 days; all with messages pretty much equivalent of the below:

[2018-04-14T00:51:49,666][WARN ][o.e.c.s.MasterService ] [elasticsearch-master-0] cluster state update task [put-mapping[application-logs]] took [30.1s] above the warn threshold of 30s

bleskes · April 30, 2018, 6:49am

@Evesy @lunatic thank you for the research. I opened https://github.com/elastic/elasticsearch/pull/30244 with a proposed fix. It needs to be reviewed but you can follow progress on github.

lunatic · April 30, 2018, 7:02am

I see it, thanks @bleskes

Evesy · April 30, 2018, 8:10am

@bleskes Cheers for looking into this -- Do you think this & Replica Shards Much Larger Than Primaries are down to the same root cause (As we do seem to notice they happen together), or does it look like a separate issue to you?

Thanks

bleskes · April 30, 2018, 8:25am

It might. A lagging local checkpoint will prevent translog clean up as of version 6.2.0 : https://github.com/elastic/elasticsearch/pull/27606

djtecha · May 17, 2018, 6:08am

Looks like this problem is occurring on 6.2.4 Usually it's not a problem but when I have to bounce a node the shards that have this 'corruption' have to completely rebuild again vs just copying the translog. I also am ingesting about 1TB per hour. If I can help with any details I'd be glad to assist.

bleskes · May 17, 2018, 6:23am

@djetcha if you see in your logs errors with put-mapping you can be pretty sure it's https://github.com/elastic/elasticsearch/pull/30244

Topic		Replies	Views
ES 6.2.4 What should be done when having "out of sync replica" error Elasticsearch	4	1294	August 22, 2018
Replicas out of sync Elasticsearch	19	4774	February 28, 2018
Unable to complete Synced Flush on idle index Elasticsearch	9	2982	March 8, 2018
Curator: Error with forcemerge Elasticsearch	3	1262	July 17, 2018
Indexing operations have failed Elasticsearch	5	522	January 30, 2023

Out Of Sync Replica

Related topics