Hi,
We've been getting a few errors recently due to 'out of sync replicas' which is affecting certain cluster operations. The issues seem to tie in with an upgrade from 6.1.2 to 6.2.3 however could be completely unrelated.
We have a curator job that runs in the early morning each day however it's recently been failing due to out of sync replicas on the same particular index:
2018-04-03 01:00:39,310 ERROR Failed to complete action: close. <class 'curator.exceptions.FailedExecution'>: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: TransportError(409, u'{"_shards":{"total":44,"successful":41,"failed":3},".monitoring-logstash-6-2018.04.02":{"total":2,"successful":2,"failed":0},".monitoring-kibana-6-2018.04.02":{"total":2,"successful":2,"failed":0},"beats-2018.04.02":{"total":4,"successful":4,"failed":0},"metrics20-2018.04.02":{"total":20,"successful":20,"failed":0},"applicationlogs-2018.04.02":{"total":12,"successful":9,"failed":3,"failures":[{"shard":3,"reason":"out of sync replica; num docs on replica [23209297]; num docs on primary [23209299]","routing":{"state":"STARTED","primary":false,"node":"x9vWEhARS5ef63_YWknIlQ","relocating_node":null,"shard":3,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"4FfgsYKxQUa6yH64AX1ybg"}}},{"shard":1,"reason":"out of sync replica; num docs on replica [23212334]; num docs on primary [23212335]","routing":{"state":"STARTED","primary":false,"node":"XuLt8WgrSDeTBlz4-kC7sg","relocating_node":null,"shard":1,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"0koS5ZM8R0aIiDoXhzqziw"}}},{"shard":5,"reason":"out of sync replica; num docs on replica [23213113]; num docs on primary [23213114]","routing":{"state":"STARTED","primary":false,"node":"1JEjyXKtTQuwOThziCwhMw","relocating_node":null,"shard":5,"index":"applicationlogs-2018.04.02","allocation_id":{"id":"3t_zNITkSwi-VAFQwPmuyg"}}}]},".monitoring-es-6-2018.04.02":{"total":2,"successful":2,"failed":0},"asm-2018.04.02":{"total":2,"successful":2,"failed":0}}')
"applicationlogs-2018.04.02": {
"total": 12,
"successful": 9,
"failed": 3,
"failures": [
{
"shard": 3,
"reason": "out of sync replica; num docs on replica [23209297]; num docs on primary [23209299]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "x9vWEhARS5ef63_YWknIlQ",
"relocating_node": null,
"shard": 3,
"index": "applicationlogs-2018.04.02",
"allocation_id": {
"id": "4FfgsYKxQUa6yH64AX1ybg"
}
}
},
{
"shard": 1,
"reason": "out of sync replica; num docs on replica [23212334]; num docs on primary [23212335]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "XuLt8WgrSDeTBlz4-kC7sg",
"relocating_node": null,
"shard": 1,
"index": "applicationlogs-2018.04.02",
"allocation_id": {
"id": "0koS5ZM8R0aIiDoXhzqziw"
}
}
},
{
"shard": 5,
"reason": "out of sync replica; num docs on replica [23213113]; num docs on primary [23213114]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "1JEjyXKtTQuwOThziCwhMw",
"relocating_node": null,
"shard": 5,
"index": "applicationlogs-2018.04.02",
"allocation_id": {
"id": "3t_zNITkSwi-VAFQwPmuyg"
}
}
}
]
},
The only topics I've come across online mention it could be due to deleted documents but I think that can be ruled out as this index is purely additive, nothing is ever removed. The job above is to close the indices (Before compression is applied, routing to warm nodes, force merge etc.), and this is running an hour after the index has stopped being written to (as it rolls daily).
This has been fairly consistently happening recently and I'm not sure why -- The current fix has been to reduce replicas to 0, perform the operations needed, then increase replicas back to 1. The settings for the index are below (taken from todays index):
{
"applicationlogs-2018.04.03": {
"settings": {
"index": {
"codec": "best_compression",
"routing": {
"allocation": {
"require": {
"box_type": "hot"
}
}
},
"refresh_interval": "30s",
"number_of_shards": "6",
"translog": {
"flush_threshold_size": "1gb",
"sync_interval": "15s",
"durability": "async"
},
"provided_name": "applicationlogs-2018.04.03",
"creation_date": "1522713600418",
"store": {
"type": "mmapfs"
},
"unassigned": {
"node_left": {
"delayed_timeout": "20m"
}
},
"number_of_replicas": "1",
"uuid": "UKteUDw_Rl-pimKXiwTwpg",
"version": {
"created": "6020399"
}
}
}
}
I've also verified the cluster has been green at the time of the errors, and there were no ongoing rebalance operations (From looking at the monitoring page)
Any help would be much appreciated
Cheers,
Mike