How to recover a failed data node in Production cluster?

ImdotnetJunkie · February 26, 2017, 5:20pm

What steps must to be performed to recover a failed node in a cluster?

warkolm · February 26, 2017, 8:37pm

You should just be able to restart it and it'll rejoin.

Though that may depend on the failure mode.

ImdotnetJunkie · February 27, 2017, 4:07am

What if

unassigned_shards: > 0
OR
initializing_shards: > 0

OR in general, Cluster health is RED or Yellow

warkolm · February 27, 2017, 4:09am

If it's initialising then just wait for everything to finish.
If they're unassigned then, it depends.

What version are you on?

ImdotnetJunkie · February 27, 2017, 4:10am

We are still on old version 1.7.5 though we are planning to upgrade.

This is our current status:

cluster_name: "abc",
status: "red",
timed_out: false,
number_of_nodes: 2,
number_of_data_nodes: 2,
active_primary_shards: 4226,
active_shards: 8452,
relocating_shards: 0,
initializing_shards: 0,
unassigned_shards: 3781,
delayed_unassigned_shards: 0,
number_of_pending_tasks: 0,
number_of_in_flight_fetch: 0

warkolm · February 27, 2017, 4:12am

That's faaaaaaar too many shards and likely to be causing your issues.

ImdotnetJunkie · February 27, 2017, 4:19am

We have got around 800+ indices. How can this be resolved ?

warkolm · February 27, 2017, 4:24am

Reindex. You will need to do that to get to 5.X anyway, so it might make sense to start a 5.X cluster, then run a remote index to pull the data from that old cluster - assuming you still want it.

ImdotnetJunkie · February 27, 2017, 4:28am

Thanks @warkolm. I will get back if I have any further questions.

system · March 27, 2017, 4:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.