Safely disable/enable replicas to fix unassigned shards

(Steve Walsh) #1

Hi all,

We ran into an issue in our production ELK cluster last Friday. The system has since recovered but is in a yellow state with 778 unassigned shards. I've tried to reroute but the two errors coming back say the maximum amount of retries has occurred and also getting "CorruptIndexException[file mismatch, expected id=emn1xk6ao17n2rtyuigqwgs6t, got=beyzbx59saz7jis7fi2dr2rjm"

I found another topic over the weekend(which I cannot find now) but the fix is to set the replicas for each index to 0 and then set it back to 1 again.

PUT /my_index/_settings
"index" : {
    "number_of_replicas" : 1

What I want to ask is when I set the replicas to 1 again will it be too much for our stack to handle? The issue that occurred Friday was adding some new log data created around 500ish indices at once and crashed the system. If someone can point me in the right direction as to how much is a safe number to do at once that would be great

  "cluster_name": "dx-p-elk",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 8,
  "number_of_data_nodes": 4,
  "active_primary_shards": 17918,
  "active_shards": 35058,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 778,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 97.82899877218439

(Bernt Rostad) #2

If the maximum retries (5 by default) has been reached, Elasticsearch won't try again, I think the idea is to avoid an infinite retry loop and force an admin to look at the problem. If the problem was of a temporary nature and all is well now, you can manually force a retry using the Cluster Reroute API. E.g. something like this:

curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true'

However, in your case there seems to be a corrupt shard, and that can be harder to fix.

If your primary shards are all fine, i.e. if the indices are yellow, I would simply drop the replica shards, as you mention, one yellow index at a time (all green indices are fine and should be left alone). Your cluster should be able to handle that, but I would wait for the index to become green again (when replica shards are ready) before doing the same procedure on the next index, because if you overload a node and it drops out of the cluster you could lose data.

If you have one or more indices in a red state you have one or more corrupt primary shards, which means you can't recover the index by recreating the replica shards. There are only a few solutions (that I know of) to this situation:

  1. Delete the corrupt index and load data from your primary data storage into a new index.
  2. Delete the corrupt index and re-install it from the most recent snapshot.
  3. Accept some loss of data and use the Cluster Reroute API to allocate a stale or empty primary for the corrupt one.

Naturally, if Elasticsearch is your primary storage only #2 and 3 are viable solutions and if you have no recent snapshots then you will lose some data, whether you manage to reactivate a stale shard or (like me once) is forced to allocate a new empty shard for the corrupt one to save the rest of the index.

(Steve Walsh) #3

Thanks for the pointers @Bernt_Rostad. We did have some initially in red state but it has since recovered. There's now only a few left in yellow state which I'm working through them.

In my haste to write a script to automatically iterate through the shards that needed dropping of the replica something went wrong and now I've disabled replicas on every index! I am now doing as you say and slowly going through each one to rebuild the replicas.

Is there an equation or anything that would let me know how many replicas my system can handle creating before overloading anything?

(Bernt Rostad) #4

I'm glad to hear that, then you didn't lose any data.

I don't think so. You may be able to restore multiple indices without overloading a single node, I'm just using the one-index-at-a-time replication "rule" as a safety precaution in my own clusters.

The capacity of your cluster to create new replicas mainly depends on the number of nodes and, naturally, the I/O speed of your disks. The more nodes you have, the more places new replicas can be created and the faster the disks the quicker you can create the replicas. If you have N nodes and each one creates X replicas concurrently your cluster can create N * X replicas concurrently.

But to speed things up you may want to look at increasing your Indices Recovery speed (the number of MB written to disk per second) which is fairly low by default (40 MB; I typically set this to 100 MB in my clusters). You should also look at the Shard Allocation Settings, the node_concurrent_recoveries is used to set the maximum number of shards each node can recover concurrently.

(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.