Unable to initialize central management for Elastic Agents

I am new to Elasticsearch and was trying to use Fleet and Elastic Agent. It was fine initially as everything was working perfectly fine; managed to pipe logs to ES, etc, but now when I try to navigate to the Fleet page, I get this error: -

Unable to initialize central management for Elastic Agents
search_phase_execution_exception: [no_shard_available_action_exception] Reason: null

I ran this command - GET _cat/shards?v=true&h=index,shard,prirep,state,node,unassigned.reason&s=state

Result:

index                                                             shard prirep state      node   unassigned.reason
.ds-.fleet-actions-results-2022.05.24-000001                      0     r      UNASSIGNED        CLUSTER_RECOVERED
.fleet-policies-leader-7                                          0     p      UNASSIGNED        ALLOCATION_FAILED
.fleet-servers-7                                                  0     p      UNASSIGNED        ALLOCATION_FAILED
.ds-.fleet-actions-results-2022.06.25-000002                      0     r      UNASSIGNED        CLUSTER_RECOVERED
.fleet-actions-7                                                  0     p      STARTED    node-1 
.ds-logs-windows.powershell-default-2022.05.24-000001             0     p      STARTED    node-1 
.ds-metrics-windows.perfmon-default-2022.05.24-000001             0     p      STARTED    node-1 
.metrics-endpoint.metadata_united_default                         0     p      STARTED    node-1 
.kibana-event-log-7.17.3-000001                                   0     p      STARTED    node-1 
.kibana_7.17.3_001                                                0     p      STARTED    node-1 
.ds-metrics-elastic_agent.filebeat-default-2022.05.24-000001      0     p      STARTED    node-1 
.transform-internal-007                                           0     p      STARTED    node-1 
.lists-default-000001                                             0     p      STARTED    node-1 
.ds-logs-elastic_agent-default-2022.06.25-000002                  0     p      STARTED    node-1 
.ds-.logs-deprecation.elasticsearch-default-2022.06.21-000002     0     p      STARTED    node-1 
.ds-logs-windows.powershell-default-2022.06.25-000002             0     p      STARTED    node-1 
.ds-metrics-elastic_agent.elastic_agent-default-2022.05.24-000001 0     p      STARTED    node-1 
.kibana_task_manager_7.17.3_001                                   0     p      STARTED    node-1 
.apm-agent-configuration                                          0     p      STARTED    node-1 
.ds-.logs-deprecation.elasticsearch-default-2022.05.22-000001     0     p      STARTED    node-1 
metrics-endpoint.metadata_current_default                         0     p      STARTED    node-1 
.ds-metrics-elastic_agent.filebeat-default-2022.06.25-000002      0     p      STARTED    node-1 
.ds-metrics-windows.service-default-2022.06.25-000002             0     p      STARTED    node-1 
.fleet-enrollment-api-keys-7                                      0     p      STARTED    node-1 
.ds-logs-elastic_agent.fleet_server-default-2022.06.25-000002     0     p      STARTED    node-1 
.ds-logs-elastic_agent.metricbeat-default-2022.06.25-000002       0     p      STARTED    node-1 
.fleet-agents-7                                                   0     p      STARTED    node-1 
logstash-2022.05.22-000001                                        0     p      STARTED    node-1 
.ds-metrics-windows.perfmon-default-2022.06.25-000002             0     p      STARTED    node-1 
.ds-metrics-elastic_agent.fleet_server-default-2022.06.25-000002  0     p      STARTED    node-1 
.ds-winlogbeat-8.2.0-2022.05.23-000001                            0     p      STARTED    node-1 
.ds-metrics-elastic_agent.metricbeat-default-2022.05.24-000001    0     p      STARTED    node-1 
.apm-custom-link                                                  0     p      STARTED    node-1 
.ds-metrics-elastic_agent.elastic_agent-default-2022.06.25-000002 0     p      STARTED    node-1 
.ds-winlogbeat-8.2.0-2022.06.22-000002                            0     p      STARTED    node-1 
.async-search                                                     0     p      STARTED    node-1 
.security-7                                                       0     p      STARTED    node-1 
sigma-index                                                       0     p      STARTED    node-1 
.ds-.fleet-actions-results-2022.05.24-000001                      0     p      STARTED    node-1 
.ds-metrics-elastic_agent.metricbeat-default-2022.06.25-000002    0     p      STARTED    node-1 
.ds-logs-elastic_agent.metricbeat-default-2022.05.24-000001       0     p      STARTED    node-1 
.ds-metrics-windows.service-default-2022.05.24-000001             0     p      STARTED    node-1 
.ds-logs-windows.powershell_operational-default-2022.05.30-000001 0     p      STARTED    node-1 
.kibana_security_session_1                                        0     p      STARTED    node-1 
.tasks                                                            0     p      STARTED    node-1 
.geoip_databases                                                  0     p      STARTED    node-1 
.ds-logs-elastic_agent.filebeat-default-2022.06.25-000002         0     p      STARTED    node-1 
.ds-ilm-history-5-2022.06.21-000002                               0     p      STARTED    node-1 
.transform-notifications-000002                                   0     p      STARTED    node-1 
.kibana-event-log-7.17.3-000002                                   0     p      STARTED    node-1 
.ds-ilm-history-5-2022.05.22-000001                               0     p      STARTED    node-1 
.ds-.fleet-actions-results-2022.06.25-000002                      0     p      STARTED    node-1 
.ds-logs-windows.sysmon_operational-default-2022.05.31-000001     0     p      STARTED    node-1 
logstash-2022.06.21-000002                                        0     p      STARTED    node-1 
.items-default-000001                                             0     p      STARTED    node-1 
.ds-logs-elastic_agent-default-2022.05.24-000001                  0     p      STARTED    node-1 
.ds-logs-elastic_agent.filebeat-default-2022.05.24-000001         0     p      STARTED    node-1 
.fleet-policies-7                                                 0     p      STARTED    node-1 
.ds-logs-elastic_agent.fleet_server-default-2022.05.24-000001     0     p      STARTED    node-1 
.ds-metrics-elastic_agent.fleet_server-default-2022.05.24-000001  0     p      STARTED    node-1 
.siem-signals-default-000001                                      0     p      STARTED    node-1

and also ran this -

GET _cluster/allocation/explain
{
  "index": ".fleet-servers-7", 
  "shard": 0, 
  "primary": true 
}

Result:

{
  "index" : ".fleet-servers-7",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2022-06-27T11:11:28.248Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [GF2tTENwT8S7UiAEPK0WaQ]: shard failure, reason [failed to recover from translog], failure EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog from source [/var/lib/elasticsearch/nodes/0/indices/HI5E9uMpTL6FCuwtkERw-A/0/translog/translog-13.tlog] is corrupted, operation size must be at least 4 but was: 0]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "yes",
  "allocate_explanation" : "can allocate the shard",
  "target_node" : {
    "id" : "GF2tTENwT8S7UiAEPK0WaQ",
    "name" : "node-1",
    "transport_address" : "x.x.x.x:9300",
    "attributes" : {
      "ml.machine_memory" : "8312123392",
      "xpack.installed" : "true",
      "transform.node" : "true",
      "ml.max_open_jobs" : "512",
      "ml.max_jvm_size" : "4294967296"
    }
  },
  "allocation_id" : "7ERQMHPRRnCg9FUR8-d5ew",
  "node_allocation_decisions" : [
    {
      "node_id" : "GF2tTENwT8S7UiAEPK0WaQ",
      "node_name" : "node-1",
      "transport_address" : "x.x.x.x:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8312123392",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "4294967296"
      },
      "node_decision" : "yes",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "7ERQMHPRRnCg9FUR8-d5ew"
      }
    }
  ]
}

I seem to not be able to rectify this issue since I am pretty new, I am not sure what settings to change and where. Hope someone is able to help. Thank you!

Welcome to our community! :smiley:

This would suggest that your storage had/has issues and is causing data corruption. So definitely check that.

However you could try a request to _cluster/reroute?retry_failed=true to see if it'll work.

Hi, thanks for the info. I tried running _cluster/reroute?retry_failed=true but nothing changed.

I think I managed to resolve the corrupted issue by running this -
bin/elasticsearch-shard remove-corrupted-data

Which I then ran this -

POST /_cluster/reroute
{
  "commands" : [
    {
      "allocate_stale_primary" : {
        "index" : ".fleet-servers-7",
        "shard" : 0,
        "node" : "GF2tTENwT8S7UiAEPK0WaQ",
        "accept_data_loss" : true
      }
    }
  ]
}

And when I ran this command again to check -

GET _cluster/allocation/explain
{
  "index": ".fleet-servers-7", 
  "shard": 0, 
  "primary": true 
}

Seems like the corrupted translog data is somehow fixed as there's no translog corrupted message shown. But seems like I encountered another issue; No valid shard copy. This is the output -

{
  "index" : ".fleet-servers-7",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2022-06-28T07:10:56.731Z",
    "failed_allocation_attempts" : 1,
    "details" : "failed shard on node [GF2tTENwT8S7UiAEPK0WaQ]: failed recovery, failure RecoveryFailedException[[.fleet-servers-7][0]: Recovery failed on {node-1}{GF2tTENwT8S7UiAEPK0WaQ}{_8Fdn17rQTyOjua5SlNH8w}{192.168.31.131}{192.168.31.131:9300}{cdfhilmrstw}{ml.machine_memory=8312123392, xpack.installed=true, transform.node=true, ml.max_open_jobs=512, ml.max_jvm_size=4294967296}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: AccessDeniedException[/var/lib/elasticsearch/nodes/0/indices/HI5E9uMpTL6FCuwtkERw-A/0/translog/translog.ckp]; ",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "GF2tTENwT8S7UiAEPK0WaQ",
      "node_name" : "node-1",
      "transport_address" : "x.x.x.x:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8312123392",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "4294967296"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : false,
        "allocation_id" : "bUOXbAKoRLCnl0_ucqiVfA"
      }
    }
  ]
}

Did I perhaps did something wrong by removing the corrupted data?

Do you have a backup of the data? If not, you will probably need to delete the index entirely.

No, unfortunately I do not have a backup. But I am fine with deleting it, since there's not much data. So once I delete using DELETE /my-index-000001, I will just have to recreate the index again for Fleet? How do I go on after deleting?

And as seen below, there's 4 unassigned index that are in relation to the Fleet, so I'll just have to delete those indexes?

Yeah but I don't know what the impact of doing those would be, as they are system indices.

@JLeysens sorry for the direct ping, but hoping you can please provide some advice here for us.

I am not an expert in fleet, but I know that all .fleet-* indices are used by both Kibana and fleet server, additionally Kibana stores its own state outside of those indices (in .kibana* indices).

You may end up in an inconsistent state if you only delete .fleet-* you may need to start with a completely clean slate :grimacing:

When you said completely clean slate, you meant literally start back the ELK stack from ground up by reinstalling it? Also, is this an Elasticsearch or Kibana problem and any particular reason you think this happens? This is because of was perfectly fine and suddenly out of nowhere this issue happens.

The root cause of the issue seems to be this:

It sounds like something has happened at the data storage level. If we cannot recover to an earlier state then it means that data is lost, yes.

From what I've seen I'm not sure the issue exists in either Elasticsearch or Kibana since it started with translog corruption being detected.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.