We are in the process of doing some rolling restarts on our Elasticsearch data nodes and running into issues relating to shard recovery.
When I performed an upgrade from 2.3.x to 2.4.2 we followed these steps and did not have any such problems:
I will begin with some questions, and then outline what we are doing below:
Questions
- How can we reliably force previous primaries, and current replicas to recover from a restarted node as opposed to from remote nodes?
- How can we get logs on why the decisions were made to replicate from remote primaries as opposed to local replicas?
2.3.x to 2.4.2 Upgrade (Smooth sailing)
We performed the following for an effortless 2.3 - 2.4 upgrade last year.
Stop Indexing
Disable shard Allocation
on es-master:
curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '{
{
"transient": {
"cluster.routing.allocation.enable": "none"
}
}'
Perform Synced Flush (and ensure results are success)
on es-master:
curl -XPOST 'localhost:9200/_all/_flush/synced?pretty'
Restart the node (using pseudocode)
on es-data-node:
stop elasticsearch
<do system maint operations>
start elasticsearch
Wait for node to return to cluster
on es-master:
curl localhost:9200/_cat/nodes?v
Once we are sure the node is back, re-enable shard-allocation
Re-enable Shard Allocation
on es-master:
curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
"transient": {
"cluster.routing.allocation.enable": "all"
}
}'
Observe that shards allocate via local data on that node
And everything worked great. We had a cluster with 24 data nodes, 2 client nodes and 3 master nodes. Within that cluster 6 - 8 indexes with shard counts between 5 - 24 with a replication factor of 1.
2.4.2 Rolling Restart (troubled waters)
Now our shards are a on the higher end of the recommended 30GB size, so when we have attempted to do another rolling-restart of our data nodes (no version update), we have not seen the same behavior. 1, 2, 3, 4 are all done the same, but when 5. is executed (re-enable shard allocation), the nodes are being assigned to new nodes, or with forced assignment, copying their data from new nodes rather than the local dataset.
Here is a list of things we have tried to do to rectify the situation:
Set node_left.delayed_timeout to 10mins
We were hoping that just this option would do it for us, but we aren't seeing that working either.
curl -XPUT "localhost:9200/_all/_settings" -d '
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "10m"
}
}'
Force all allocation off / on
curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '
{
"transient": {
"cluster.routing.rebalance.enable": "NONE",
"cluster.routing.allocation.enable": "none",
"cluster.routing.allocation.disable_allocation": "true"
}
}'
<perform restart>
curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '
{
"transient": {
"cluster.routing.rebalance.enable": "ALL",
"cluster.routing.allocation.enable": "all",
"cluster.routing.allocation.disable_allocation": "false"
}
}'
Force Allocate Shards to the same node
curl -XPOST 'localhost:9200/_cluster/reroute' -d '
{
"commands": [
{"allocate": {"index": "pix_v1", "shard": 9, "node": "foo-bar-es-data-i-713814"}},
{"allocate": {"index": "pix_v1", "shard": 4, "node": "foo-bar-es-data-i-713814"}},
{"allocate": {"index": "pix_v2", "shard": 22, "node": "foo-bar-es-data-i-713814"}},
{"allocate": {"index": "pix_v2", "shard": 15, "node": "foo-bar-es-data-i-713814"}},
{"allocate": {"index": "people_v0", "shard": 4, "node": "foo-bar-es-data-i-713814"}}
]
}'
Here we just see the foo-bar-es-data-i-713814 node being hit with high IO when attempting to read the data from other member node primaries. It is ignoring the local copy.
Attempt Different Combinations of the Disable Options
We have attempted to keep cluster rebalancing off and just allocation on, and vice versa.
Current Settings:
{
"transient": {
"logger": {
"indices": {
"recovery": "DEBUG"
},
"cluster": {
"service": "DEBUG",
"routing": "DEBUG"
},
},
"indices": {
"recovery": {
"max_bytes_per_sec": "40mb",
"concurrent_streams": "3",
"concurrent_small_file_streams": "2"
}
},
"cluster": {
"routing": {
"allocation": {
"enable": "all",
"node_concurrent_recoveries": "2",
"cluster_concurrent_rebalance": "2",
"disable_allocation": "false"
},
"rebalance": {
"enable": "ALL"
}
}
}
},
"persistent": {
"indices": {
"recovery": {
"max_bytes_per_sec": "200mb",
"concurrent_streams": "5"
}
},
"cluster": {
"routing": {
"allocation": {
"node_concurrent_recoveries": "5"
}
}
}
}
}
Any guidance would be helpful. Thank you.