Unable to allocate index

Hi All,

I´veen facing issue with an elasticsearch cluster, that first out of the blue, and later after a restart failed to allocate an index.
The cluser has a single node atm and replication is disabled.
When I check allocation status I see:

GET {{elastic}}:9200/_cluster/allocation/explain?pretty&include_yes_decisions=true
> {
>   "index": "logstash-2019.10.04",
>   "shard": 0,
>   "primary": true,
>   "current_state": "unassigned",
>   "unassigned_info": {
>     "reason": "ALLOCATION_FAILED",
>     "at": "2019-10-15T12:25:27.278Z",
>     "failed_allocation_attempts": 5,
>     "details": "failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ",
>     "last_allocation_status": "no"
>   },
>   "can_allocate": "no",
>   "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
>   "node_allocation_decisions": [
>     {
>       "node_id": "gjegRSM1Rbi22HYJOFYINw",
>       "node_name": "3730edbdef97",
>       "transport_address": "xxx.17.0.2:9300",
>       "node_attributes": {
>         "ml.machine_memory": "4294967296",
>         "xpack.installed": "true",
>         "ml.max_open_jobs": "20"
>       },
>       "node_decision": "no",
>       "store": {
>         "in_sync": true,
>         "allocation_id": "Xs0xcNPyQdORV_hS5JUkHg"
>       },
>       "deciders": [
>         {
>           "decider": "max_retry",
>           "decision": "NO",
>           "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T12:25:27.278Z], failed_attempts[5], delayed=false, details[failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ], allocation_status[deciders_no]]]"
>         },
>         {
>           "decider": "replica_after_primary_active",
>           "decision": "YES",
>           "explanation": "shard is primary and can be allocated"
>         },
>         {
>           "decider": "enable",
>           "decision": "YES",
>           "explanation": "all allocations are allowed"
>         },
>         {
>           "decider": "node_version",
>           "decision": "YES",
>           "explanation": "the primary shard is new or already existed on the node"
>         },
>         {
>           "decider": "snapshot_in_progress",
>           "decision": "YES",
>           "explanation": "no snapshots are currently running"
>         },
>         {
>           "decider": "restore_in_progress",
>           "decision": "YES",
>           "explanation": "ignored as shard is not being recovered from a snapshot"
>         },
>         {
>           "decider": "filter",
>           "decision": "YES",
>           "explanation": "node passes include/exclude/require filters"
>         },
>         {
>           "decider": "same_shard",
>           "decision": "YES",
>           "explanation": "the shard does not exist on the same node"
>         },
>         {
>           "decider": "disk_threshold",
>           "decision": "YES",
>           "explanation": "there is only a single data node present"
>         },
>         {
>           "decider": "throttling",
>           "decision": "YES",
>           "explanation": "below primary recovery limit of [4]"
>         },
>         {
>           "decider": "shards_limit",
>           "decision": "YES",
>           "explanation": "total shard limits are disabled: [index: -1, cluster: -1] <= 0"
>         },
>         {
>           "decider": "awareness",
>           "decision": "YES",
>           "explanation": "allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it"
>         }
>       ]
>     }
>   ]
> }

I tried running:

POST {{elastic}}:9200/_cluster/reroute?retry_failed

And when I run:

{{elastic}}:9200/_cat/shards

I see the indices as initilizing for a big, but then they change to unasigned again

I also tried using lucene check to fix the index like this:

/usr/share/elasticsearch/jdk/bin/java -cp "*" -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/

And it runs reporting no problems.

Any ideas?

Thanks in advance.

In future please use the </> button to format fixed-width text. It's basically impossible to read otherwise. I've fixed the unreadable bit of your post above for you.

Is the file that Elasticsearch claims is missing actually missing? I.e. does .../nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt exist or not?

Did anything untoward happen to this node in the recent past? E.g. a power outage or other sudden shutdown?

Do you have a snapshot of this index from which you can restore a good copy?

Thank you for the formatting!
The first time it happened, it was out of the blue, the second time was right after I restarted elasticsearch.
The file is there but seems to be broken, and we do not have snapshots:

Ok, this seems bad, and unrelated to Elasticsearch if ls also reports No such file or directory on a file that seems to be present. This looks like a filesystem issue. Can you fsck this filesystem?

Also I think there will be a full stack trace of the NoSuchFileException in your server logs. Can you share that please?

Sadly I cannot fsck.
The logs for elasticsearch rotated, so all I have related to this error is:

Due to desperation, I deleted both offending indexes using:
DELETE {{elastic}}:9200/logstash-2019.10.04
and
DELETE {{elastic}}:9200/logstash-2019.10.03

After that I started seeing the following log over and over again, until it stopt a few minutes ago:

{"type": "server", "timestamp": "2019-10-16T17:32:55,192+0000", "level": "WARN", "component": "o.e.i.IndicesService", "cluster.name": "docker-cluster", "node.name": "3730edbdef97", "cluster.uuid": "69cIMsbUTUOiK3MOdNggow", "node.id": "gjegRSM1Rbi22HYJOFYINw",  "message": "[logstash-2019.10.03/Qcz9ztHhQaCjSvUFuF3VlQ] still pending deletes present for shards [[[logstash-2019.10.03/Qcz9ztHhQaCjSvUFuF3VlQ]]] - retrying"  }
{"type": "server", "timestamp": "2019-10-16T17:32:55,610+0000", "level": "WARN", "component": "o.e.i.IndicesService", "cluster.name": "docker-cluster", "node.name": "3730edbdef97", "cluster.uuid": "69cIMsbUTUOiK3MOdNggow", "node.id": "gjegRSM1Rbi22HYJOFYINw",  "message": "[logstash-2019.10.04/np3g96ylRuGrsJKZ4Zo2LA] still pending deletes present for shards [[[logstash-2019.10.04/np3g96ylRuGrsJKZ4Zo2LA]]] - retrying"  }

Ok, interesting, we're hitting this exception trying to delete these files which at least means they don't contain any data that isn't held elsewhere.

However I think this filesystem is in an inconsistent state and I wouldn't recommend trusting it with anything important. It's probably a good idea to take a snapshot ASAP in case it starts behaving even worse. You must at least run a fsck to try and fix any inconsistencies, although this might lose some other data. Alternatively move all your data onto a known-good filesystem and wipe this one.

ok, i will try that, thank you david!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.