Curator - shard has exceeded the maximum number of retries [1]

KaZuKi_Yashiro · November 24, 2021, 12:16pm

Good afternoon.

Current cnfiguration:
ES version 6.8.8 in docker "docker.elastic.co/elasticsearch/elasticsearch:6.8.8"
heap_size: 31g
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 1253,
"active_shards" : 2232

Avr size per index ~28Gb.

Curator version 5.8.4

We use ES curator to shrink old indices, instead "3 primary 1 replica" curator does "1 primary 1 replica".
And we got some problems here
i.e. curator creates copy of old index with suffix "-shrink" then creates primary shard and succesfully allocates him, but when he tries allocate replica shard we've got this error:

{
  "index" : "example-index-2021-09-29-shrink",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2021-11-23T12:26:19.515Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [8r_zhRD4RDm2peWnDun_3w]: failed recovery, failure RecoveryFailedException[[example-index-2021-09-29-shrink][0]: Recovery failed from {node15}{nWOPSov3TFKUunoiooVxMQ}{PSAfiXvZQx-NLyKpnXGs1A}{192.168.0.164}{192.168.0.164:9300}{ml.machine_memory=135291469824, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {node13}{8r_zhRD4RDm2peWnDun_3w}{KU0HhEPMQ_ilSV3RCe4XNw}{192.168.0.162}{192.168.0.162:9300}{ml.machine_memory=135291469824, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[node15][172.17.0.3:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [85] files with total size of [24.8gb]]; nested: ReceiveTimeoutTransportException[[node13][192.168.0.162:9300][internal:index/shard/recovery/file_chunk] request_id [1586168734] timed out after [899897ms]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "8r_zhRD4RDm2peWnDun_3w",
      "node_name" : "node13",
      "transport_address" : "192.168.0.162:9300",
      "node_attributes" : {
        "ml.machine_memory" : "135291469824",
        "xpack.installed" : "true",
        "ml.max_open_jobs" : "20",
        "ml.enabled" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [1] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2021-11-23T12:26:19.515Z], failed_attempts[1], delayed=false, details[failed shard on node [8r_zhRD4RDm2peWnDun_3w]: failed recovery, failure RecoveryFailedException[[example-index-2021-09-29-shrink][0]: Recovery failed from {node15}{nWOPSov3TFKUunoiooVxMQ}{PSAfiXvZQx-NLyKpnXGs1A}{192.168.0.164}{192.168.0.164:9300}{ml.machine_memory=135291469824, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} into {node13}{8r_zhRD4RDm2peWnDun_3w}{KU0HhEPMQ_ilSV3RCe4XNw}{192.168.0.162}{192.168.0.162:9300}{ml.machine_memory=135291469824, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}]; nested: RemoteTransportException[[node15][172.17.0.3:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [85] files with total size of [24.8gb]]; nested: ReceiveTimeoutTransportException[[node13][192.168.0.162:9300][internal:index/shard/recovery/file_chunk] request_id [1586168734] timed out after [899897ms]]; ], allocation_status[no_attempt]]]"
        }
      ]

I've tried to create a template like this:

 "shrink" : {
    "order" : 0,
    "index_patterns" : [
      "*-shrink"
    ],
    "settings" : {
      "index" : {
        "allocation" : {
          "max_retries" : "5"
        }
      }

But it doesn't work... Here are indices settings after successful shrink.

GET /example-index-shrink/_settings

{
  "example-index-shrink" : {
    "settings" : {
      "index" : {
        "allocation" : {
          "max_retries" : "1"
        },
        "shrink" : {
          "source" : {
            "name" : "example-index",
            "uuid" : "mecKKzDDTzu77ViMv5N3EA"
          }
        },
        "blocks" : {
          "write" : null
        },
        "provided_name" : "example-index-shrink",
        "creation_date" : "1637751350836",
        "number_of_replicas" : "1",
        "uuid" : "MI_wbW35R8ubkYZOySfp1g",
        "version" : {
          "created" : "6080899",
          "upgraded" : "6080899"
        },
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "initial_recovery" : {
              "_id" : "nWOPSov3TFKUunoiooVxMQ"
            },
            "require" : {
              "_name" : null
            }
          }
        },
        "number_of_shards" : "1",
        "routing_partition_size" : "1",
        "resize" : {
          "source" : {
            "name" : "example-index",
            "uuid" : "mecKKzDDTzu77ViMv5N3EA"
          }
        }
      }
    }
  }
}

How I can change index.allocation.max_retries value for shrinking indices?
I can't see that settings in curator action file

Thanks in advance

system · December 22, 2021, 12:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.