Unable to allocate index

Merlin_Nunez · October 15, 2019, 12:57pm

Hi All,

I´veen facing issue with an elasticsearch cluster, that first out of the blue, and later after a restart failed to allocate an index.
The cluser has a single node atm and replication is disabled.
When I check allocation status I see:

GET {{elastic}}:9200/_cluster/allocation/explain?pretty&include_yes_decisions=true
> {
>   "index": "logstash-2019.10.04",
>   "shard": 0,
>   "primary": true,
>   "current_state": "unassigned",
>   "unassigned_info": {
>     "reason": "ALLOCATION_FAILED",
>     "at": "2019-10-15T12:25:27.278Z",
>     "failed_allocation_attempts": 5,
>     "details": "failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ",
>     "last_allocation_status": "no"
>   },
>   "can_allocate": "no",
>   "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
>   "node_allocation_decisions": [
>     {
>       "node_id": "gjegRSM1Rbi22HYJOFYINw",
>       "node_name": "3730edbdef97",
>       "transport_address": "xxx.17.0.2:9300",
>       "node_attributes": {
>         "ml.machine_memory": "4294967296",
>         "xpack.installed": "true",
>         "ml.max_open_jobs": "20"
>       },
>       "node_decision": "no",
>       "store": {
>         "in_sync": true,
>         "allocation_id": "Xs0xcNPyQdORV_hS5JUkHg"
>       },
>       "deciders": [
>         {
>           "decider": "max_retry",
>           "decision": "NO",
>           "explanation": "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-10-15T12:25:27.278Z], failed_attempts[5], delayed=false, details[failed shard on node [gjegRSM1Rbi22HYJOFYINw]: failed recovery, failure RecoveryFailedException[[logstash-2019.10.04][0]: Recovery failed on {3730edbdef97}{gjegRSM1Rbi22HYJOFYINw}{QP5gwwEgTKm44DzJisNfaQ}{xxx.17.0.2}{xxx.17.0.2:9300}{ml.machine_memory=4294967296, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: NoSuchFileException[/usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt]; ], allocation_status[deciders_no]]]"
>         },
>         {
>           "decider": "replica_after_primary_active",
>           "decision": "YES",
>           "explanation": "shard is primary and can be allocated"
>         },
>         {
>           "decider": "enable",
>           "decision": "YES",
>           "explanation": "all allocations are allowed"
>         },
>         {
>           "decider": "node_version",
>           "decision": "YES",
>           "explanation": "the primary shard is new or already existed on the node"
>         },
>         {
>           "decider": "snapshot_in_progress",
>           "decision": "YES",
>           "explanation": "no snapshots are currently running"
>         },
>         {
>           "decider": "restore_in_progress",
>           "decision": "YES",
>           "explanation": "ignored as shard is not being recovered from a snapshot"
>         },
>         {
>           "decider": "filter",
>           "decision": "YES",
>           "explanation": "node passes include/exclude/require filters"
>         },
>         {
>           "decider": "same_shard",
>           "decision": "YES",
>           "explanation": "the shard does not exist on the same node"
>         },
>         {
>           "decider": "disk_threshold",
>           "decision": "YES",
>           "explanation": "there is only a single data node present"
>         },
>         {
>           "decider": "throttling",
>           "decision": "YES",
>           "explanation": "below primary recovery limit of [4]"
>         },
>         {
>           "decider": "shards_limit",
>           "decision": "YES",
>           "explanation": "total shard limits are disabled: [index: -1, cluster: -1] <= 0"
>         },
>         {
>           "decider": "awareness",
>           "decision": "YES",
>           "explanation": "allocation awareness is not enabled, set cluster setting [cluster.routing.allocation.awareness.attributes] to enable it"
>         }
>       ]
>     }
>   ]
> }

I tried running:

POST {{elastic}}:9200/_cluster/reroute?retry_failed

And when I run:

{{elastic}}:9200/_cat/shards

I see the indices as initilizing for a big, but then they change to unasigned again

I also tried using lucene check to fix the index like this:

/usr/share/elasticsearch/jdk/bin/java -cp "*" -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/

And it runs reporting no problems.

Any ideas?

Thanks in advance.

DavidTurner · October 15, 2019, 1:54pm

In future please use the </> button to format fixed-width text. It's basically impossible to read otherwise. I've fixed the unreadable bit of your post above for you.

Is the file that Elasticsearch claims is missing actually missing? I.e. does .../nodes/0/indices/np3g96ylRuGrsJKZ4Zo2LA/0/index/_17i.fdt exist or not?

Did anything untoward happen to this node in the recent past? E.g. a power outage or other sudden shutdown?

Do you have a snapshot of this index from which you can restore a good copy?

Merlin_Nunez · October 15, 2019, 5:35pm

Thank you for the formatting!
The first time it happened, it was out of the blue, the second time was right after I restarted elasticsearch.
The file is there but seems to be broken, and we do not have snapshots:

DavidTurner · October 15, 2019, 5:39pm

Ok, this seems bad, and unrelated to Elasticsearch if ls also reports No such file or directory on a file that seems to be present. This looks like a filesystem issue. Can you fsck this filesystem?

DavidTurner · October 15, 2019, 5:40pm

Also I think there will be a full stack trace of the NoSuchFileException in your server logs. Can you share that please?

Merlin_Nunez · October 16, 2019, 5:41pm

Sadly I cannot fsck.
The logs for elasticsearch rotated, so all I have related to this error is:

Due to desperation, I deleted both offending indexes using:
DELETE {{elastic}}:9200/logstash-2019.10.04
and
DELETE {{elastic}}:9200/logstash-2019.10.03

After that I started seeing the following log over and over again, until it stopt a few minutes ago:

{"type": "server", "timestamp": "2019-10-16T17:32:55,192+0000", "level": "WARN", "component": "o.e.i.IndicesService", "cluster.name": "docker-cluster", "node.name": "3730edbdef97", "cluster.uuid": "69cIMsbUTUOiK3MOdNggow", "node.id": "gjegRSM1Rbi22HYJOFYINw",  "message": "[logstash-2019.10.03/Qcz9ztHhQaCjSvUFuF3VlQ] still pending deletes present for shards [[[logstash-2019.10.03/Qcz9ztHhQaCjSvUFuF3VlQ]]] - retrying"  }
{"type": "server", "timestamp": "2019-10-16T17:32:55,610+0000", "level": "WARN", "component": "o.e.i.IndicesService", "cluster.name": "docker-cluster", "node.name": "3730edbdef97", "cluster.uuid": "69cIMsbUTUOiK3MOdNggow", "node.id": "gjegRSM1Rbi22HYJOFYINw",  "message": "[logstash-2019.10.04/np3g96ylRuGrsJKZ4Zo2LA] still pending deletes present for shards [[[logstash-2019.10.04/np3g96ylRuGrsJKZ4Zo2LA]]] - retrying"  }

DavidTurner · October 17, 2019, 7:38am

Ok, interesting, we're hitting this exception trying to delete these files which at least means they don't contain any data that isn't held elsewhere.

However I think this filesystem is in an inconsistent state and I wouldn't recommend trusting it with anything important. It's probably a good idea to take a snapshot ASAP in case it starts behaving even worse. You must at least run a fsck to try and fix any inconsistencies, although this might lose some other data. Alternatively move all your data onto a known-good filesystem and wipe this one.

Merlin_Nunez · October 17, 2019, 12:02pm

ok, i will try that, thank you david!

system · November 14, 2019, 12:02pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unable to allocate index Elasticsearch	2	309	January 12, 2022
Index in red Elasticsearch	4	1106	June 3, 2018
Unassigned Shard Elasticsearch	4	759	January 3, 2020
Two unassigned shards as failed to create engine with error no such file exception Elasticsearch	4	725	April 22, 2019
Cannot allocate because allocation is not permitted to any of the nodes Elasticsearch	6	14196	July 26, 2017

Unable to allocate index

Related topics