Hello,
I am using Elastic 8.11 and was attempting to setup a cluster with ilm for hot/warm/cold.
However after creating the instance and roles and users I get the system indices stuck in initializing state and cannot move forward from there because my cluster is always yellow.
I am using aws and terraform to provision everything. This setup worked for elastic version 7.X but not for this version.
I have tried killing the nodes but when the shard gets reassigned it just gets stuck again.
.ds-ilm-history-5-2024.01.11-000001 0 r STARTED 10.5.33.173 elasticsearch-test-518cc74fc5
.ds-ilm-history-5-2024.01.11-000001 0 p STARTED 10.5.33.78 elasticsearch-test-a15d32e977
.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001 0 p STARTED 10.5.33.173 elasticsearch-test-518cc74fc5
.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001 0 r INITIALIZING 10.5.33.33 elasticsearch-omniproductivity-3b90082630
.security-7 0 p STARTED 6 36kb 36kb 10.5.33.185 elasticsearch-test-2afa6c30ff
.security-7 0 r STARTED 6 36kb 36kb 10.5.33.102 elasticsearch-test-0b56dcc882
When I killed the previous node and did an explain the new node accepted this replica
{
"node_id": "qosTMvHnShWwDuwh3sFrOg",
"node_name": "elasticsearch-test-3b90082630",
"transport_address": "10.5.33.33:9300",
"node_attributes": {
"xpack.installed": "true",
"ml.max_jvm_size": "1073741824",
"cold": "true",
"warm": "false",
"ml.allocated_processors": "2",
"ml.machine_memory": "4064817152",
"ml.config_version": "11.0.0",
"aws_availability_zone": "us-east-1a",
"hot": "false",
"ml.allocated_processors_double": "2.0",
"transform.config_version": "10.0.0"
},
"roles": [
"data",
"data_cold",
"data_content",
"data_frozen",
"data_hot",
"data_warm",
"ingest",
"ml",
"remote_cluster_client",
"transform"
],
"node_decision": "yes"
In the nodes I get a failure to connect messages although the cluster reports the total number of nodes I provisioned.
Log example of the node that failed to initialize the shard
{"type": "server", "timestamp": "2024-01-12T11:36:50,304Z", "level": "ERROR", "component": "o.e.x.s.a.e.NativeUsersStore", "cluster.name": "infra-qa-us-east-1-elasticsearch-test", "node.name": "elasticsearch-test-518cc74fc5", "message": "failed to retrieve user [exporter]", "cluster.uuid": "6tlIvIzORe21KZTLT0RsDQ", "node.id": "QTI5JWpgR4qbVkWipmRtRQ" ,
And on another node there is a failure to connect
"message": "recovery of [.ds-.logs-deprecation.elasticsearch-default-2024.01.11-000001][0] from [{elasticsearch-test-518cc74fc5}{QTI5JWpgR4qbVkWipmRtRQ}{X9jMKU3jRB-zeeETzp1sAw}{elasticsearch-test-518cc74fc5}{10.5.33.173}{10.5.33.173:9300}{cdfhilrstw}{8.11.1}{7000099-8500003}{ml.allocated_processors=2, warm=true, cold=false, xpack.installed=true, ml.max_jvm_size=1073741824, transform.config_version=10.0.0, ml.allocated_processors_double=2.0, hot=false, aws_availability_zone=us-east-1c, ml.machine_memory=4064804864, ml.config_version=11.0.0}] interrupted by network disconnect, will retry in [5s]; cause: [[elasticsearch-test-518cc74fc5][10.5.33.173:9300] Node not connected]",
"cluster.uuid": "6tlIvIzORe21KZTLT0RsDQ",
"node.id": "qosTMvHnShWwDuwh3sFrOg"
}
How can I debug this situation?