Issues with Backup repository setup & shards allocation

HI Team,
Below is my production setup.

10.XX.YY.ZZ 25 99 13 1.81 1.52 1.29 md - prodcution_master_data_Server3

10.XX.YY.ZZ 41 99  7 0.25 0.25 0.29 d  - prodcution_data_Server1

10.XX.YY.ZZ 21 99 10 1.61 1.47 1.28 -  - prodcution_client_Server3

10.XX.YY.ZZ  6 99  4 0.25 0.25 0.29 m  - prodcution_master_Server1

10.XX.YY.ZZ 72 99 16 0.83 0.77 0.93 md * prodcution_master_data_Server2

I am now setting up archival strategy here to take snapshot of all indices including the cluster state, but i am getting frequent shard allocation errors.

The default no of shards allocated is 5 and i have reduced the no of replica's to 1.

Also after setting up the backup repository, i did take few backup's for a cluster of indices but at on point i am getting a message "concurrent_snapshot_execution_exception" which i killing me in proceeding further. I triedto query the status,but still couldn't get any information except snapshot

Is there anything which i could do here which would help me to proceed further and resolve all these problems, I am planning to reduce the no of shards to 3(will i be able to achieve it) and i will also setup cronjobs and curator soon here.

It's impossible to help without seeing the actual errors that you're seeing.

Similarly, it's impossible to help without seeing the full response. There will be a lot more detail than just this message.

Ok, Here is my problem,
I was able to take a backup(snapshot) of set of indices which i have indexed in the month of January with the below api.

PUT /_snapshot/index_backups/snapshot_jan19_index-name?wait_for_completion=true
{
  "indices": "index-name-2019.01.0*",
  "ignore_unavailable": true,
  "include_global_state": true
}

The snapshot activity was successfull, and have recieved below response.

snapshot_jan19_index-name SUCCESS 1550560280 23:11:20 1550560301 23:11:41 21.3s 10 50 0 50

Since indexing was successful, i tried deleting one index manually, so i could test restore operation

DELETE index-name-2019.01.11

I tried restoring the index and was successful again

POST /_snapshot/kindex_backups/snapshot_jan19_index-name/_restore
{
  "indices": "index-name-2019.01.11"
}

Restore is also successful, but unable to get the status of the restore index yet,instead getting the below error message.

GET /index-name-2019.01.11/_count

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": []
  },
  "status": 503
}

Thanks, that is very helpful.

Restores are, by default, asynchronous operations: "success" means the cluster has successfully started to restore the snapshot, not that it has completed. If you want to wait for the restore to complete, set the wait_for_completion parameter:

POST /_snapshot/kindex_backups/snapshot_jan19_index-name/_restore?wait_for_completion=true
{
  "indices": "index-name-2019.01.11"
}

Additionally you can monitor the progress of the restore operation using the indices recovery API, or by waiting for all the shards to be allocated using the cluster health API:

GET /_cluster/health?wait_for_status=green

Hi Turner,
Thanks for the information, I understand that the snapshot can only be restored when the cluster status is green. But i have a problem here, my cluster status is showing up Red for a long time and snapshots are little baby steps which i have taken to remove old and unwanted indices so i can save some space in the allocated servers. Also i see this unassigned shards coming up more frequently which i believe is the cause for the cluster being Red.

GET /_cluster/health

{
  "cluster_name": "production_cluster",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 5,
  "number_of_data_nodes": 3,
  "active_primary_shards": 5349,
  "active_shards": 10520,
  "relocating_shards": 0,
  "initializing_shards": 0,
  _**"unassigned_shards": 189,**_
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 98.23512933046969
}

You have over 10,000 active shards in this 3-node cluster. This is far too many and will cause you problems. Here is an article that gives more detail:

I don't think this is the case. It should be possible to restore a snapshot whatever your cluster health (as long as it has a master, of course).

Indeed, a health of red means there are unassigned primary shards. To find the shards that are unassigned you can use GET _cat/shards, and then you can investigate why they are unassigned using the allocation explain API.

Hi Turner,
Thanks for pointing the problem of accumulating indexes within 3 node production cluster, i am tryning to look into the possible resolution steps for the same.

I am planning to reduce the no of shards to be 3 for all indexes and will only one replica.

But my current status of the cluster seems to red because of these many shards, Considering this what would be the best option to take this forward

shrink indexes or reindexing!!

Also,since i have snapshot backup of all january indices can i go ahead and delete them to reduce the no of shards!! -- I am asking this beacuse i haven't tested the restore activity yet.

And finally can i move the contents of snapshot repository to another backup server completely.

It depends on why your cluster is currently red, hence...

... can you explain the reasoning behind that conclusion?

Why not 1 shard? How big is each index? Are these daily indices? If so, can you move to weekly or monthly ones instead?

Probably wise to test that restores work first. You can also just close some unneeded indices for now while you get your cluster under control.

Hi David,
After long struggle, i can only reach status where my cluster status has turned yellow and still there too many open shards available, I will consider your valuable suggestion of changing all daily indices into weekly ones, while i start doing it

how can i change their defaults shard allocation to be 1 along with replica as 1!!

It's really hard to help here because you're not sharing much information and are possibly jumping to conclusions. A yellow health means there are unassigned shards, but this isn't normally because there's too many open shards. As I said above:

The number of shards comes from the index settings which probably come from the index template. To adjust the default number of shards, change the index template.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.