System indexes stuck initializing state

Hello,
I am using Elastic 8.11 and was attempting to setup a cluster with ilm for hot/warm/cold.

However after creating the instance and roles and users I get the system indices stuck in initializing state and cannot move forward from there because my cluster is always yellow.
I am using aws and terraform to provision everything. This setup worked for elastic version 7.X but not for this version.
I have tried killing the nodes but when the shard gets reassigned it just gets stuck again.

.ds-ilm-history-5-2024.01.11-000001                           0 r STARTED                10.5.33.173 elasticsearch-test-518cc74fc5
.ds-ilm-history-5-2024.01.11-000001                           0 p STARTED                10.5.33.78  elasticsearch-test-a15d32e977
.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001 0 p STARTED                10.5.33.173 elasticsearch-test-518cc74fc5
.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001 0  r INITIALIZING             10.5.33.33  elasticsearch-omniproductivity-3b90082630
.security-7                                                   0 p STARTED    6 36kb 36kb 10.5.33.185 elasticsearch-test-2afa6c30ff
.security-7                                                   0 r STARTED    6 36kb 36kb 10.5.33.102 elasticsearch-test-0b56dcc882

When I killed the previous node and did an explain the new node accepted this replica

    {
      "node_id": "qosTMvHnShWwDuwh3sFrOg",
      "node_name": "elasticsearch-test-3b90082630",
      "transport_address": "10.5.33.33:9300",
      "node_attributes": {
        "xpack.installed": "true",
        "ml.max_jvm_size": "1073741824",
        "cold": "true",
        "warm": "false",
        "ml.allocated_processors": "2",
        "ml.machine_memory": "4064817152",
        "ml.config_version": "11.0.0",
        "aws_availability_zone": "us-east-1a",
        "hot": "false",
        "ml.allocated_processors_double": "2.0",
        "transform.config_version": "10.0.0"
      },
      "roles": [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "node_decision": "yes"

In the nodes I get a failure to connect messages although the cluster reports the total number of nodes I provisioned.

Log example of the node that failed to initialize the shard

{"type": "server", "timestamp": "2024-01-12T11:36:50,304Z", "level": "ERROR", "component": "o.e.x.s.a.e.NativeUsersStore", "cluster.name": "infra-qa-us-east-1-elasticsearch-test", "node.name": "elasticsearch-test-518cc74fc5", "message": "failed to retrieve user [exporter]", "cluster.uuid": "6tlIvIzORe21KZTLT0RsDQ", "node.id": "QTI5JWpgR4qbVkWipmRtRQ" ,

And on another node there is a failure to connect

  "message": "recovery of [.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001][0] from [{elasticsearch-test-518cc74fc5}{QTI5JWpgR4qbVkWipmRtRQ}{X9jMKU3jRB-zeeETzp1sAw}{elasticsearch-test-518cc74fc5}{10.5.33.173}{10.5.33.173:9300}{cdfhilrstw}{8.11.1}{7000099-8500003}{ml.allocated_processors=2, warm=true, cold=false, xpack.installed=true, ml.max_jvm_size=1073741824, transform.config_version=10.0.0, ml.allocated_processors_double=2.0, hot=false, aws_availability_zone=us-east-1c, ml.machine_memory=4064804864, ml.config_version=11.0.0}] interrupted by network disconnect, will retry in [5s]; cause: [[elasticsearch-test-518cc74fc5][10.5.33.173:9300] Node not connected]",
  "cluster.uuid": "6tlIvIzORe21KZTLT0RsDQ",
  "node.id": "qosTMvHnShWwDuwh3sFrOg"
}

How can I debug this situation?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.