Elastic Replica and Unassigned Shards

Hi

First time posting here, I've start managing a recently created onprem cluster, so far I'm having "fun" with all the issues popping in, but I'm far to dominate all concepts.

This cluster is hot-warm-cold tier and we have 5 data nodes: 2 hot, 2 warm and 1 cold.

One of index is getting yellow, I've tried many approaches after reading the forums here or blogs from the internet.

ILM is set to 0 replicas on cold - we have one node only, so I wonder if this is still the issue, since index is not moving at all ?

ILM for this index is setup like default for Hot -> 3 days -> warm -> 30 days -> cold -> 90 days -> delete

If I check some API, I can see which ones are having issues and making our cluster yelllow:

GET /_cat/shards?h=index,shard,prirep,state,unassigned.reason

.ds-logs-test-2024.01.18-000053                                        0 p STARTED    
.ds-logs-test-2024.01.18-000053                                        0 r UNASSIGNED PRIMARY_FAILED

.ds-logs-test-2024.01.10-000052                                        0 p STARTED    
.ds-logs-test-2024.01.10-000052                                        0 r UNASSIGNED PRIMARY_FAILED

Like I said, we have 0 replicas configured in ILM for Cold Phase, since we have 1 node only for Cold. Checking the explanation:

GET .ds-logs-test-2024.01.10-000052/_ilm/explain?human

{
  "indices" : {
    ".ds-logs-test-2024.01.10-000052" : {
      "index" : ".ds-logs-test-2024.01.10-000052",
      "managed" : true,
      "policy" : "logs-test",
      "lifecycle_date" : "2024-01-18T12:41:56.854Z",
      "lifecycle_date_millis" : 1705581716854,
      "age" : "29.17d",
      "phase" : "cold",
      "phase_time" : "2024-01-19T12:51:56.796Z",
      "phase_time_millis" : 1705668716796,
      "action" : "migrate",
      "action_time" : "2024-01-19T12:51:56.996Z",
      "action_time_millis" : 1705668716996,
      "step" : "check-migration",
      "step_time" : "2024-01-19T12:51:57.196Z",
      "step_time_millis" : 1705668717196,
      "step_info" : {
        "message" : "Waiting for all shard copies to be active",
        "shards_left_to_allocate" : -1,
        "all_shards_active" : false,
        "number_of_replicas" : 1
      },
      "phase_execution" : {
        "policy" : "logs-test",
        "phase_definition" : {
          "min_age" : "1d",
          "actions" : {
            "set_priority" : {
              "priority" : 0
            }
          }
        },
        "version" : 5,
        "modified_date" : "2023-08-24T16:31:35.792Z",
        "modified_date_in_millis" : 1692894695792
      }
    }
  }
}

Anything else I can look for ? Or how to solve this ?

Hi @ntex Welcome to the community

What Version?

What are the index settings when you run this?

GET .ds-logs-test-2024.01.10-000052

Can you rerun the with node? Is that the cold Node or are they still stuck on the warm node

GET /_cat/shards?h=index,node,shard,prirep,state,unassigned.reason

GET _cluster/allocation/explain
{
  "index": ".ds-logs-test-2024.01.10-000052"
}

Also do any shards show up on the cold nodes?

Share...

GET _ilm/yourpolicy

GET _cat/nodes

For this interested in the roles and attributes section
GET _nodes/settings

example find the cold node

     "roles": [
        "data_content",
        "data_hot",
        "ingest",
        "master",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "availability_zone": "us-west1-a",
        "instance_configuration": "gcp.es.datahot.n2.68x16x45",
        "region": "unknown-region",
        "server_name": "instance-00asfdasdfsadfasbd74c467",
        "transform.config_version": "10.0.0",
        "xpack.installed": "true",
        "ml.config_version": "12.0.0",
        "data": "hot",
        "logical_availability_zone": "zone-2"
      },

Hi @stephenb much appreciated! Thanks!!

So posting the extra information that you requested.
We are at version 8.0.0, but we plan to update at least to version 8.9, we are not sure yet, if we can/should update to 8.12 due to some APM changes.
(also, update OS to Ubuntu 22.04.06 LTS, after)

GET .ds-logs-test-2024.01.10-000052

{
  ".ds-logs-test-2024.01.10-000052" : {
    "aliases" : { },
    "mappings" : {
      "dynamic" : "true",
      "_data_stream_timestamp" : {
        "enabled" : true
      },
      "dynamic_date_formats" : [
        "strict_date_optional_time",
        "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
      ],
      "dynamic_templates" : [
        {
          "match_ip" : {
            "match" : "ip",
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "ip"
            }
          }
        },
        {
          "match_message" : {
            "match" : "message",
            "match_mapping_type" : "string",
            "mapping" : {
              "type" : "match_only_text"
            }
          }
        },
        {
          "strings_as_keyword" : {
            "match_mapping_type" : "string",
            "mapping" : {
              "ignore_above" : 1024,
              "type" : "keyword"
            }
          }
        }
      ],
      "date_detection" : false,
      "numeric_detection" : false,
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "keyword",
          "ignore_above" : 1024
        },
        "data_stream" : {
          "properties" : {
            "dataset" : {
              "type" : "constant_keyword",
              "value" : "generic"
            },
            "namespace" : {
              "type" : "constant_keyword",
              "value" : "default"
            },
            "type" : {
              "type" : "constant_keyword",
              "value" : "logs"
            }
          }
        },
        "ecs" : {
          "properties" : {
            "version" : {
              "type" : "keyword",
              "ignore_above" : 1024
            }
          }
        },
        "event" : {
          "properties" : {
            "original" : {
              "type" : "keyword",
              "ignore_above" : 1024
            }
          }
        },
        "host" : {
          "type" : "object"
        },
        "message" : {
          "type" : "match_only_text"
        }
      }
    },
    "settings" : {
      "index" : {
        "lifecycle" : {
          "name" : "logs-generic",
          "indexing_complete" : "true"
        },
        "codec" : "best_compression",
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_cold,data_warm,data_hot"
            }
          }
        },
        "hidden" : "true",
        "number_of_shards" : "1",
        "provided_name" : ".ds-logs-test-2024.01.10-000052",
        "query" : {
          "default_field" : [
            "message"
          ]
        },
        "creation_date" : "1704910916910",
        "priority" : "0",
        "number_of_replicas" : "1",
        "uuid" : "GxucrpOARk-AiCOJOYgTtA",
        "version" : {
          "created" : "8000099"
        }
      }
    },
    "data_stream" : "logs-test-default"
  }
}

I forgot to post this on the initial post, this result was one the reasons that led me to think, if maybe the setting replicas = 0 on Cold Phase is being ignored ?

GET /_cat/shards?h=index,node,shard,prirep,state,unassigned.reason

.ds-logs-test-2024.01.10-000052                                        coldnode01 0 p STARTED    
.ds-logs-test-2024.01.10-000052                                                0 r UNASSIGNED PRIMARY_FAILED
...
.ds-logs-test-2024.01.18-000053                                        coldnode01 0 p STARTED    
.ds-logs-test-2024.01.18-000053                                                0 r UNASSIGNED PRIMARY_FAILED

This is the policy, pretty much what I mentioned on initial brief description:

GET _ilm/policy/logs-generic

{
  "logs-generic" : {
    "version" : 6,
    "modified_date" : "2024-02-01T11:49:48.384Z",
    "policy" : {
      "phases" : {
        "warm" : {
          "min_age" : "3d",
          "actions" : {
            "forcemerge" : {
              "max_num_segments" : 1
            },
            "set_priority" : {
              "priority" : 50
            }
          }
        },
        "cold" : {
          "min_age" : "30d",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,
              "include" : { },
              "exclude" : { },
              "require" : { }
            },
            "set_priority" : {
              "priority" : 0
            }
          }
        },
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_primary_shard_size" : "50gb",
              "max_age" : "30d"
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "delete" : {
          "min_age" : "90d",
          "actions" : {
            "delete" : {
              "delete_searchable_snapshot" : true
            }
          }
        }
      }
    },
    "in_use_by" : {
      "indices" : [
        ".ds-logs-test-2024.02.07-000056",
        ".ds-logs-test-2024.01.31-000055",
        ".ds-logs-test-2024.01.18-000053",
        ".ds-logs-test-2024.01.10-000052",
        ".ds-logs-test-2024.01.24-000054"
      ],
      "data_streams" : [
        "logs-generic-default"
      ],
      "composable_templates" : [
        "logs-generic"
      ]
    }
  }
}

All the nodes in cluster orion:

GET _cat/nodes

10.11.23.67 38 99 5 0.14 0.15 0.10 irt - ingnode
10.11.23.50 57 98 0 0.00 0.00 0.00 m   - master03
10.11.23.60 41 90 3 0.52 0.45 0.45 hs  - hotnode01
10.11.23.56 52 88 2 0.98 0.50 0.41 hs  - hotnode02
10.11.23.63  7 85 0 0.00 0.00 0.00 w   - warmnode02
10.11.23.55 39 96 0 0.00 0.00 0.00 -   - contrnode
10.11.23.61 22 84 0 0.00 0.02 0.00 w   - warmnode01
10.11.23.52 12 96 0 0.00 0.00 0.00 mv  - master02
10.11.23.62 33 98 2 0.05 0.06 0.05 c   - coldnode1
10.11.23.51 41 98 0 0.00 0.00 0.00 m   * master01

These are the node settings and now this made me realized that cold node is on version 8.0.1, which it shouldn't be (I only left data nodes settings to keep this shorter, if you need the other nodes settings, give me a heads up):

GET _nodes/settings

{
  "_nodes" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
...
    "z6uFC0iZQ_ifjX8j3OVe8w" : {
      "name" : "hotnode02",
      "transport_address" : "10.11.23.56:9300",
      "host" : "10.11.23.56",
      "ip" : "10.11.23.56",
      "version" : "8.0.0",
      "build_flavor" : "default",
      "build_type" : "deb",
      "build_hash" : "1b6a7ece17463df5ff54a3e1302d825889aa1161",
      "roles" : [
        "data_content",
        "data_hot"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "orion-cluster",
          "election" : {
            "strategy" : "supports_voting_only"
          }
        },
        "node" : {
          "name" : "hotnode02",
          "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "roles" : [
            "data_hot",
            "data_content"
          ]
        },
        "path" : {
          "data" : "/elk/lib/elasticsearch",
          "logs" : "/elk/log/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "seed_hosts" : [
            "master02",
            "master03",
            "master01"
          ]
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "host" : [
            "_local_",
            "_site_"
          ],
          "compression" : "false",
          "type" : "security4",
          "port" : "9200",
          "type.default" : "netty4"
        },
        "transport" : {
          "type" : "security4",
          "type.default" : "netty4"
        },
        "xpack" : {
          "security" : {
            "http" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "transport" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "enabled" : "true",
            "enrollment" : {
              "enabled" : "true"
            }
          }
        },
        "network" : {
          "host" : "0.0.0.0"
        }
      }
    },
...
    "EdabKikrQY-sQQpaWnWw6Q" : {
      "name" : "warmnode01",
      "transport_address" : "10.11.23.61:9300",
      "host" : "10.11.23.61",
      "ip" : "10.11.23.61",
      "version" : "8.0.0",
      "build_flavor" : "default",
      "build_type" : "deb",
      "build_hash" : "1b6a7ece17463df5ff54a3e1302d825889aa1161",
      "roles" : [
        "data_warm"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "orion-cluster",
          "election" : {
            "strategy" : "supports_voting_only"
          }
        },
        "node" : {
          "name" : "warmnode01",
          "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "roles" : [
            "data_warm"
          ]
        },
        "path" : {
          "data" : "/elk/lib/elasticsearch",
          "logs" : "/elk/log/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "seed_hosts" : [
            "master02",
            "master03",
            "master01"
          ]
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "host" : [
            "_local_",
            "_site_"
          ],
          "compression" : "false",
          "type" : "security4",
          "port" : "9200",
          "type.default" : "netty4"
        },
        "transport" : {
          "type" : "security4",
          "type.default" : "netty4"
        },
        "xpack" : {
          "security" : {
            "http" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "transport" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "enabled" : "true",
            "enrollment" : {
              "enabled" : "true"
            }
          }
        },
        "network" : {
          "host" : "0.0.0.0"
        }
      }
    },
    "Z7f4OHCXSyWtalnyiH4DcA" : {
      "name" : "warmnode02",
      "transport_address" : "10.11.23.63:9300",
      "host" : "10.11.23.63",
      "ip" : "10.11.23.63",
      "version" : "8.0.0",
      "build_flavor" : "default",
      "build_type" : "deb",
      "build_hash" : "1b6a7ece17463df5ff54a3e1302d825889aa1161",
      "roles" : [
        "data_warm"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "orion-cluster",
          "election" : {
            "strategy" : "supports_voting_only"
          }
        },
        "node" : {
          "name" : "warmnode02",
          "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "roles" : [
            "data_warm"
          ]
        },
        "path" : {
          "data" : "/elk/lib/elasticsearch",
          "logs" : "/elk/log/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "seed_hosts" : [
            "master02",
            "master03",
            "master01"
          ]
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "host" : [
            "_local_",
            "_site_"
          ],
          "compression" : "false",
          "type" : "security4",
          "port" : "9200",
          "type.default" : "netty4"
        },
        "transport" : {
          "type" : "security4",
          "type.default" : "netty4"
        },
        "xpack" : {
          "security" : {
            "http" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "transport" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "enabled" : "true",
            "enrollment" : {
              "enabled" : "true"
            }
          }
        },
        "network" : {
          "host" : "0.0.0.0"
        }
      }
    },
...
    "_Qi6Gp8tQu2nbaQzF8UHbg" : {
      "name" : "hotnode01",
      "transport_address" : "10.11.23.60:9300",
      "host" : "10.11.23.60",
      "ip" : "10.11.23.60",
      "version" : "8.0.0",
      "build_flavor" : "default",
      "build_type" : "deb",
      "build_hash" : "1b6a7ece17463df5ff54a3e1302d825889aa1161",
      "roles" : [
        "data_content",
        "data_hot"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "orion-cluster",
          "election" : {
            "strategy" : "supports_voting_only"
          }
        },
        "node" : {
          "name" : "hotnode01",
          "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "roles" : [
            "data_hot",
            "data_content"
          ]
        },
        "path" : {
          "data" : "/elk/lib/elasticsearch",
          "logs" : "/elk/log/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "seed_hosts" : [
            "master02",
            "master03",
            "master01"
          ]
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "host" : [
            "_local_",
            "_site_"
          ],
          "compression" : "false",
          "type" : "security4",
          "port" : "9200",
          "type.default" : "netty4"
        },
        "transport" : {
          "type" : "security4",
          "type.default" : "netty4"
        },
        "xpack" : {
          "security" : {
            "http" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "transport" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "enabled" : "true",
            "enrollment" : {
              "enabled" : "true"
            }
          }
        },
        "network" : {
          "host" : "0.0.0.0"
        }
      }
    },
    "-h6vvlcfQtKkeftVE9z2cw" : {
      "name" : "coldnode01",
      "transport_address" : "10.11.23.62:9300",
      "host" : "10.11.23.62",
      "ip" : "10.11.23.62",
      "version" : "8.0.1",
      "build_flavor" : "default",
      "build_type" : "deb",
      "build_hash" : "801d9ccc7c2ee0f2cb121bbe22ab5af77a902372",
      "roles" : [
        "data_cold"
      ],
      "attributes" : {
        "xpack.installed" : "true"
      },
      "settings" : {
        "cluster" : {
          "name" : "orion-cluster",
          "election" : {
            "strategy" : "supports_voting_only"
          }
        },
        "node" : {
          "name" : "coldnode01",
          "pidfile" : "/var/run/elasticsearch/elasticsearch.pid",
          "attr" : {
            "xpack" : {
              "installed" : "true"
            }
          },
          "roles" : [
            "data_cold"
          ]
        },
        "path" : {
          "data" : "/elk/lib/elasticsearch",
          "logs" : "/elk/log/elasticsearch",
          "home" : "/usr/share/elasticsearch"
        },
        "discovery" : {
          "seed_hosts" : [
            "master02",
            "master03",
            "master01"
          ]
        },
        "client" : {
          "type" : "node"
        },
        "http" : {
          "host" : [
            "_local_",
            "_site_"
          ],
          "compression" : "false",
          "type" : "security4",
          "port" : "9200",
          "type.default" : "netty4"
        },
        "transport" : {
          "type" : "security4",
          "type.default" : "netty4"
        },
        "xpack" : {
          "security" : {
            "http" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "transport" : {
              "ssl" : {
                "enabled" : "true"
              }
            },
            "enabled" : "true",
            "enrollment" : {
              "enabled" : "true"
            }
          }
        },
        "network" : {
          "host" : "0.0.0.0"
        }
      }
    }
  }
}

Again, this made realize that one node is 8.0.1, instead of all being on 8.0.0.
But, since it's cold regardless, shouldn't be a problem ?
Since primary shard was moved there, just the replicas are not being deleted right ?

GET /_cluster/allocation/explain
{
  "index": ".ds-logs-test-2024.01.10-000052",
  "shard": 0,
  "primary": true
}

{
  "index" : ".ds-logs-test-2024.01.10-000052",
  "shard" : 0,
  "primary" : true,
  "current_state" : "started",
  "current_node" : {
    "id" : "-h6vvlcfQtKkeftVE9z2cw",
    "name" : "coldnode01",
    "transport_address" : "10.11.23.62:9300",
    "attributes" : {
      "xpack.installed" : "true"
    },
    "weight_ranking" : 1
  },
  "can_remain_on_current_node" : "yes",
  "can_rebalance_cluster" : "no",
  "can_rebalance_cluster_decisions" : [
    {
      "decider" : "rebalance_only_when_active",
      "decision" : "NO",
      "explanation" : "rebalancing is not allowed until all replicas in the cluster are active"
    },
    {
      "decider" : "cluster_rebalance",
      "decision" : "NO",
      "explanation" : "the cluster has unassigned shards and cluster setting [cluster.routing.allocation.allow_rebalance] is set to [indices_all_active]"
    }
  ],
  "can_rebalance_to_other_node" : "no",
  "rebalance_explanation" : "rebalancing is not allowed",
  "node_allocation_decisions" : [
    {
      "node_id" : "EdabKikrQY-sQQpaWnWw6Q",
      "node_name" : "warmnode01",
      "transport_address" : "10.11.23.61:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot relocate primary shard from a node with version [8.0.1] to a node with older version [8.0.0]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "Z7f4OHCXSyWtalnyiH4DcA",
      "node_name" : "warmnode02",
      "transport_address" : "10.11.23.63:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 3,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot relocate primary shard from a node with version [8.0.1] to a node with older version [8.0.0]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "z6uFC0iZQ_ifjX8j3OVe8w",
      "node_name" : "hotnode02",
      "transport_address" : "10.11.23.56:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 4,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot relocate primary shard from a node with version [8.0.1] to a node with older version [8.0.0]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "_Qi6Gp8tQu2nbaQzF8UHbg",
      "node_name" : "hotnode01",
      "transport_address" : "10.11.23.60:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "weight_ranking" : 5,
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot relocate primary shard from a node with version [8.0.1] to a node with older version [8.0.0]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    }
  ]
}

Thanks once again for your help!

You ran this only on the primary... I wanted to see the replica

BUT Also, we can see that the index still has replicas when trying to go to cold.

One thing I see is the ILM Policy is Version 6 with

        "cold" : {
          "min_age" : "30d",
          "actions" : {
            "allocate" : {
              "number_of_replicas" : 0,

But the actual index that is having an issue is on version 5 and that is the current version it is trying to use in this action / phase ... so did 5 have number_of_replicas 0? If not, that is probably the issue

"version" : 5,

New ILM policies can be picked up during certain phases / actions, but if the index is already trying to execute an action, it stays on the version it is using.

So my theory is that you changed the replicas to 0 after it already started to try to move to cold

So you can fix this probably just manually setting number of replicas to 0 on the "stuck indices" the keep an eye and see if subsequent work correctly...

Give a try and report back

Hi @stephenb indeed, I'm sorry didn't notice the primary = true.

Here it is replica:

GET /_cluster/allocation/explain
{
  "index": ".ds-logs-test-2024.01.10-000052",
  "shard": 0,
  "primary": false
}

{
  "index" : ".ds-logs-test-2024.01.10-000052",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "PRIMARY_FAILED",
    "at" : "2024-02-14T09:33:09.768Z",
    "details" : "primary failed while replica initializing",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "-h6vvlcfQtKkeftVE9z2cw",
      "node_name" : "coldnode01",
      "transport_address" : "10.11.23.62:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.ds-logs-test-2024.01.10-000052][0], node[-h6vvlcfQtKkeftVE9z2cw], [P], s[STARTED], a[id=xBmKniNVTZ274tC7rGDJ4A]]"
        }
      ]
    },
    {
      "node_id" : "EdabKikrQY-sQQpaWnWw6Q",
      "node_name" : "warmnode01",
      "transport_address" : "10.11.23.61:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [8.0.0] since this is older than the primary version [8.0.1]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "Z7f4OHCXSyWtalnyiH4DcA",
      "node_name" : "warmnode02",
      "transport_address" : "10.11.23.63:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [8.0.0] since this is older than the primary version [8.0.1]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "_Qi6Gp8tQu2nbaQzF8UHbg",
      "node_name" : "hotnode01",
      "transport_address" : "10.11.23.60:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [8.0.0] since this is older than the primary version [8.0.1]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    },
    {
      "node_id" : "z6uFC0iZQ_ifjX8j3OVe8w",
      "node_name" : "hotnode02",
      "transport_address" : "10.11.23.56:9300",
      "node_attributes" : {
        "xpack.installed" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "node_version",
          "decision" : "NO",
          "explanation" : "cannot allocate replica shard to a node with version [8.0.0] since this is older than the primary version [8.0.1]"
        },
        {
          "decider" : "data_tier",
          "decision" : "NO",
          "explanation" : "index has a preference for tiers [data_cold,data_warm,data_hot] and node does not meet the required [data_cold] tier"
        }
      ]
    }
  ]
}

At one point, yes, we did tried to replace settings directly on one of the index to 0, wait a few hours and then restore it back to 1, also we did remove the ILM and then added back (directly on problematic indexes)

I set the replicas on those 2 indexes to 0.

Do you think, I will have the issue regardless on next index, I should have another index trying to move to cold in next 5 days or so - .ds-logs-test-2024.01.24-000054.

Thanks!

I don't know... I would hope not...

You can always create a sample ILM with 1 day Rollover and straight to cold and see if works... and feed in some data ...

Yeah so that is right BUT after a bunch of failed allocation attempts the allocator stops trying to you have to ask it to try again... so set to 0 then then try this I think

POST _cluster/reroute?retry_failed=true

Why would you "restore it back to 1," ?

Ok, just did that the indexes are green, cluster went from 99.7% to 100% green.

Now, I just realized you mentioned those indexes were on version 5, instead of the current version 6.

But, I remember one time, we remove and added and refresh it, where did I missed those indexes were version: 5 instead of 6 ?

oh yea @stephenb forgot to add, we did that because the index was still on hot, not moving at all, initially.

It is in the ILM Explain at the bottom... shows the ILM Policy version that is currently be applied to the index...

Ah, I see, so when you typed version: 5, was just asking as-before.
I thought, I missed that information on logs I posted, since I only saw version: 6

I will post in few days, when the next index has to rollover from warm to cold.

@stephenb
During the weekend, all the indexes that were supposed to roll over to Cold, they did successfully, no unassigned shards.

So it's safe to assume, that there is no longer an issue with these indexes.

The only thing, I still didn't understood, is why version 5 was an issue before and not now
"version" : 5,
"modified_date" : "2023-08-24T16:31:35.792Z",

Most indexes that rollover were version 6, but I had few with version 5 and no issues this time around.

Anyways, I'm glad all worked in end and I appreciate all the help you've provided. :wink:

So this week, I will try to update the cluster, I've been reading documentation which is the best path do it, it seems it's something like this ?

Update path:
data nodes > master nodes > coordinator > ingester
cold > warm > hot
replicas > primary

Then, we can upgrade Ubuntu 20 LTS to 22 LTS, since we will be on above the 8.3.

Here are specific instructions

@stephenb just to give a little feedback in return.
The indices were never an issue anymore.

We also upgrade our Cluster to version 8.9.2 and everything went smoothly.

Thanks for all your help!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.