Rolling Restart -- Local Replicas do not reuse local data (2.4.2)

We are in the process of doing some rolling restarts on our Elasticsearch data nodes and running into issues relating to shard recovery.

When I performed an upgrade from 2.3.x to 2.4.2 we followed these steps and did not have any such problems:

I will begin with some questions, and then outline what we are doing below:

Questions

  • How can we reliably force previous primaries, and current replicas to recover from a restarted node as opposed to from remote nodes?
  • How can we get logs on why the decisions were made to replicate from remote primaries as opposed to local replicas?

2.3.x to 2.4.2 Upgrade (Smooth sailing)

We performed the following for an effortless 2.3 - 2.4 upgrade last year.

Stop Indexing
Disable shard Allocation

on es-master:

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '{
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}'

Perform Synced Flush (and ensure results are success)

on es-master:

curl -XPOST 'localhost:9200/_all/_flush/synced?pretty'

Restart the node (using pseudocode)

on es-data-node:

stop elasticsearch
<do system maint operations>
start elasticsearch

Wait for node to return to cluster

on es-master:

curl localhost:9200/_cat/nodes?v

Once we are sure the node is back, re-enable shard-allocation

Re-enable Shard Allocation

on es-master:

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "all"
  }
}'

Observe that shards allocate via local data on that node

And everything worked great. We had a cluster with 24 data nodes, 2 client nodes and 3 master nodes. Within that cluster 6 - 8 indexes with shard counts between 5 - 24 with a replication factor of 1.

2.4.2 Rolling Restart (troubled waters)

Now our shards are a on the higher end of the recommended 30GB size, so when we have attempted to do another rolling-restart of our data nodes (no version update), we have not seen the same behavior. 1, 2, 3, 4 are all done the same, but when 5. is executed (re-enable shard allocation), the nodes are being assigned to new nodes, or with forced assignment, copying their data from new nodes rather than the local dataset.

Here is a list of things we have tried to do to rectify the situation:

Set node_left.delayed_timeout to 10mins

We were hoping that just this option would do it for us, but we aren't seeing that working either.

curl -XPUT "localhost:9200/_all/_settings" -d '
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}'

Force all allocation off / on

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '
{
  "transient": {
    "cluster.routing.rebalance.enable": "NONE",
    "cluster.routing.allocation.enable": "none",
    "cluster.routing.allocation.disable_allocation": "true"
  }
}'

<perform restart>

curl -XPUT 'localhost:9200/_cluster/settings?pretty' -d '
{
  "transient": {
    "cluster.routing.rebalance.enable": "ALL",
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.allocation.disable_allocation": "false"
  }
}'

Force Allocate Shards to the same node

curl -XPOST 'localhost:9200/_cluster/reroute' -d '
{
  "commands": [
    {"allocate": {"index": "pix_v1", "shard": 9, "node": "foo-bar-es-data-i-713814"}},
    {"allocate": {"index": "pix_v1", "shard": 4, "node": "foo-bar-es-data-i-713814"}},
    {"allocate": {"index": "pix_v2", "shard": 22, "node": "foo-bar-es-data-i-713814"}},
    {"allocate": {"index": "pix_v2", "shard": 15, "node": "foo-bar-es-data-i-713814"}},
    {"allocate": {"index": "people_v0", "shard": 4, "node": "foo-bar-es-data-i-713814"}}
  ]
}'

Here we just see the foo-bar-es-data-i-713814 node being hit with high IO when attempting to read the data from other member node primaries. It is ignoring the local copy.

Attempt Different Combinations of the Disable Options
We have attempted to keep cluster rebalancing off and just allocation on, and vice versa.

Current Settings:

{
  "transient": {
    "logger": {
      "indices": {
        "recovery": "DEBUG"
      },
      "cluster": {
        "service": "DEBUG",
        "routing": "DEBUG"
      },
    },
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "40mb",
        "concurrent_streams": "3",
        "concurrent_small_file_streams": "2"
      }
    },
    "cluster": {
      "routing": {
        "allocation": {
          "enable": "all",
          "node_concurrent_recoveries": "2",
          "cluster_concurrent_rebalance": "2",
          "disable_allocation": "false"
        },
        "rebalance": {
          "enable": "ALL"
        }
      }
    }
  },
  "persistent": {
    "indices": {
      "recovery": {
        "max_bytes_per_sec": "200mb",
        "concurrent_streams": "5"
      }
    },
    "cluster": {
      "routing": {
        "allocation": {
          "node_concurrent_recoveries": "5"
        }
      }
    }
  }
}

Any guidance would be helpful. Thank you.

Two questions that quickly come to mind:

  1. Did you stop indexing? indexing violates the synced flush markers
  2. Did you check that the output of the synced flushing indicated it was successful?

@bleskes,

Yes we stopped our ingress pipeline and observed our Elasticsearch metrics to ensure no index operations were being propagated to the cluster. Also on the synced flush yes, we saw something similar to:

{
  "_shards" : {
    "total" : 135,
    "successful" : 135,
    "failed" : 0
  },
  "blomp_v0" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "baz_v1" : {
    "total" : 48,
    "successful" : 48,
    "failed" : 0
  },
  ".kibana" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "baz_v2" : {
    "total" : 48,
    "successful" : 48,
    "failed" : 0
  },
  "boop" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "foo_v0" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "bar_v0" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  }
}

I was originally thinking that some rogue index operations could have been causing the primaries to become out of sync but I am positive we have stopped all incoming index jobs.

Something interesting from last time is that we have two (2) types of data-nodes in the cluster now, we have a and b. a is where most of the data lives, but we run some CPU intensive queries on b nodes so we have defined rules to allocate certain indices to certain nodes. Could that be confusing the balancer in some way?

If the cluster was stable before than those allocations rules should not confuse the balancer. I also understand the you tried force allocating to foo-bar-es-data-i-713814, a node you know has data and still the recovery takes long. This means that the balancer made the right allocation decision but that the recovery didn't reuse files.

Here is what I suggest you do - choose a single shard on a node you want to monitor. When your index is green, stop indexing and issue synced flush. Use the procedure described here to see that the synced flush markers are put on both shard copies (and are the same). Restart the node hosting the shard and see how it goes. You can set the indices.recovery log to TRACE before the restart and it will dump a lot of information about the recovery and why it does what it does.

It's hard to say that was the only change because we also upgraded from 2.3 to 2.4.2 at that time so it could be something introduced in the version upgrade as well.

Thank you for that suggestion @bleskes. I will give that a try and report back.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.