How to recover a failed data node in Production cluster?

What steps must to be performed to recover a failed node in a cluster?

You should just be able to restart it and it'll rejoin.

Though that may depend on the failure mode.

What if

unassigned_shards: > 0
OR
initializing_shards: > 0

OR in general, Cluster health is RED or Yellow

If it's initialising then just wait for everything to finish.
If they're unassigned then, it depends.

What version are you on?

We are still on old version 1.7.5 though we are planning to upgrade.

This is our current status:

cluster_name: "abc",
status: "red",
timed_out: false,
number_of_nodes: 2,
number_of_data_nodes: 2,
active_primary_shards: 4226,
active_shards: 8452,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 3781,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0

That's faaaaaaar too many shards and likely to be causing your issues.

We have got around 800+ indices. How can this be resolved ?

Reindex. You will need to do that to get to 5.X anyway, so it might make sense to start a 5.X cluster, then run a remote index to pull the data from that old cluster - assuming you still want it.

Thanks @warkolm. I will get back if I have any further questions.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.