Hi All,
I´veen facing issue with an elasticsearch cluster, that first out of the blue, and later after a restart failed to allocate an index.
The cluser has a single node atm and replication is disabled.
When I check allocation status I see:
GET {{elastic}}:9200/_cluster/allocation/explain?pretty&include_yes_decisions=true
> {
> "index": "logstash-2019.10.04",
> "shard": 0,
> "primary": true,
> "current_state": "unassigned",
> "unassigned_info": {
> "reason": "ALLOCATION_FAILED",
> "at": "2019-10-15T12:25:27.278Z",
> "failed_allocation_attempts": 5,
> "details": "failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ",
> "last_allocation_status": "no"
> },
> "can_allocate": "no",
> "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
> "node_allocation_decisions": [
> {
> "node_id": "gjegRSM1Rbi22HYJOFYINw",
> "node_name": "3730edbdef97",
> "transport_address": "xxx.17.0.2:9300",
> "node_attributes": {
> "ml.machine_memory": "4294967296",
> "xpack.installed": "true",
> "ml.max_open_jobs": "20"
> },
> "node_decision": "no",
> "store": {
> "in_sync": true,
> "allocation_id": "Xs0xcNPyQdORV_hS5JUkHg"
> },
> "deciders": [
> {
> "decider": "max_retry",
> "decision": "NO",
> "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T12:25:27.278Z], failed_attempts[5], delayed=false, details[failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ], allocation_status[deciders_no]]]"
> },
> {
> "decider": "replica_after_primary_active",
> "decision": "YES",
> "explanation": "shard is primary and can be allocated"
> },
> {
> "decider": "enable",
> "decision": "YES",
> "explanation": "all allocations are allowed"
> },
> {
> "decider": "node_version",
> "decision": "YES",
> "explanation": "the primary shard is new or already existed on the node"
> },
> {
> "decider": "snapshot_in_progress",
> "decision": "YES",
> "explanation": "no snapshots are currently running"
> },
> {
> "decider": "restore_in_progress",
> "decision": "YES",
> "explanation": "ignored as shard is not being recovered from a snapshot"
> },
> {
> "decider": "filter",
> "decision": "YES",
> "explanation": "node passes include/exclude/require filters"
> },
> {
> "decider": "same_shard",
> "decision": "YES",
> "explanation": "the shard does not exist on the same node"
> },
> {
> "decider": "disk_threshold",
> "decision": "YES",
> "explanation": "there is only a single data node present"
> },
> {
> "decider": "throttling",
> "decision": "YES",
> "explanation": "below primary recovery limit of [4]"
> },
> {
> "decider": "shards_limit",
> "decision": "YES",
> "explanation": "total shard limits are disabled: [index: -1, cluster: -1] <= 0"
> },
> {
> "decider": "awareness",
> "decision": "YES",
> "explanation": "allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it"
> }
> ]
> }
> ]
> }
I tried running:
POST {{elastic}}:9200/_cluster/reroute?retry_failed
And when I run:
{{elastic}}:9200/_cat/shards
I see the indices as initilizing for a big, but then they change to unasigned again
I also tried using lucene check to fix the index like this:
/usr/share/elasticsearch/jdk/bin/java -cp "*" -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/
And it runs reporting no problems.
Any ideas?
Thanks in advance.