How to restart data nodes with outdated data?

Hi,

we have an Elasticsearch 7 cluster in production with 3 physical machines.

Each machines hosts 3 nodes:

  • 1 node: master-eligible + data
  • 2 nodes: data only

Cluster is configured so that replica shards are copied on nodes that are located on different physical machines.

One of the physical machines has just crashed and it is going to be repaired in a few days.

Cluster status is currently "red" as one index had a replication factor to 0. The missing shard was located on one node on the crashed machine.
This index is not important as we can rebuild it from scratch.

I wonder what is the procedure to restart the data nodes on the crashed machine, as data on the other machines are still updated during the crashed machine downtime :

  1. should we remove the red index before restarting the data node on the crashed machine so that the cluster status become green again?
  2. is it safe to restart these data nodes with outdated data, or should we clear all data on the crashed machine before restarting the nodes? (i.e. when we restart the crashed machine, new nodes will join the cluster)
  3. what is the right procedure to remove data for a node? Is it just to remove the folder defined by "path.data" attribute in config/elasticsearch.yml?

Thanks!

If the red index has changed since you lost the node, then yes delete it. Otherwise it should recover.

Otherwise, it does depend a bit on what version you are running.

Now it's OK.

We remove the red index and the cluster become "green" again.

We disconnect the physical machine from the network before repairing it to prevent old nodes with outdated data to join the cluster when the physical machine will start.

After the physical machine is repaired, we connect to it. Elasticsearch services had been already stopped as the network was not available.
We purge data/, logs/ and work/ folders.
We reconnect the physical machine to the network.

Then we restart the node one by one.
For each of them, we monitor /_cat/health?v and /_cat/nodes?v&s=name endpoints to be sure that the cluster status stays "green" and that the node successfully joins the cluster.

We also see that shard reallocation starts after restarting the first data node.

So, you can close this case Mark. Have a nice day.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.