UNASSIGNED ALLOCATION_FAILED

I have a Elasticsearch 6.2.3 setup (single node). The system was running smoothly for a period of time. Needed to expand the hard-drive, since the disk space was about to run out. After the expansion, Elasticsearch came up, but after a period of time, the health status turned red:
curl -s 'localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "ClusterX",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 880,
"active_shards" : 880,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 15,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.32402234636871
}

The reason is UNASSIGNED ALLOCATION_FAILED:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 53499 100 53499 0 0 17127 0 0:00:03 0:00:03 --:--:-- 17130
indexA-2020.03.12 1 p UNASSIGNED ALLOCATION_FAILED
indexA-2020.03.12 4 p UNASSIGNED ALLOCATION_FAILED
...
_cluster/allocation call says that too many files are open:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "indexA-2020.03.12",
"shard" : 4,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-04-22T10:40:10.827Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [MXMUEqXRTxKfiSQCeQQvTg]: failed recovery, failure RecoveryFailedException[[indexA-2020.03.12][4]: Recovery failed on {MXMUEqX}{MXMUEqXRTxKfiSQCeQQvTg}{wsB9s7OWSRq7zLXX26d5NQ}{127.0.0.1}{127.0.0.1:9300}{rack_id=r1}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/opt/elasticsearch/nodes/0/indices/2G7F7nCQT7y_yMCybVV9xQ/4/translog/translog-251.ckp: Too many open files]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "MXMUEqXRTxKfiSQCeQQvTg",
"node_name" : "MXMUEqX",
"transport_address" : "127.0.0.1:9300",
"node_attributes" : {
"rack_id" : "r1"
},
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "_YsZ8lYWT069lYzts-Eg2A"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-04-22T10:40:10.827Z], failed_attempts[5], delayed=false, details[failed shard on node [MXMUEqXRTxKfiSQCeQQvTg]: failed recovery, failure RecoveryFailedException[[indexA-2020.03.12][4]: Recovery failed on {MXMUEqX}{MXMUEqXRTxKfiSQCeQQvTg}{wsB9s7OWSRq7zLXX26d5NQ}{127.0.0.1}{127.0.0.1:9300}{rack_id=r1}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/opt/elasticsearch/nodes/0/indices/2G7F7nCQT7y_yMCybVV9xQ/4/translog/translog-251.ckp: Too many open files]; ], allocation_status[deciders_no]]]"
}
]
}
]
}

Once, I truncated the tanslog with elasticsearch-truncate, the system comes up, but again, after a while, the same issue occurs, index from a different date.
All those dates are before the date, when the partition was expanded.
Any idea, how to permanently fix this?

Thanks

Welcome!

You probably have too many shards per node.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

And https://www.elastic.co/webinars/using-rally-to-get-your-elasticsearch-cluster-size-right

Also this message

Too many open files

Seems to indicate that as well. Make sure you have enough file descriptors available. See https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html

I have 65536 fd configured.
curl -X GET "localhost:9200/_nodes/stats/process?filter_path=**.max_file_descriptors&pretty"
{
"nodes" : {
"U45GBT3kRo-Z1pj8mxDr5Q" : {
"process" : {
"max_file_descriptors" : 65536
}
}
}
}
and I have 6 shards per index.
The really weird part is that it only started occurring after the disk expansion. I also don't understand that it creeps up during the running of Elasticsearch.
Remember, ES is functioning for a while (returns queries). Then, for some reason that the status switches to red. I can understand this during startup, when ES initiates the shards, but not during normal operation.
Do you think that 6 shards are too many?

You don't have 6 shards. But 880 + 15 pending, so 895 shards allocated to one single node.

Sorry, I meant 6 shards per index.

Are 1000+ shards per node an issue?

It depends. But if you look at the resources I shared with you, you will see that we don't recommend having more than 20 shards per Gb of HEAP.
And as we don't recommend using HEAP bigger than 30gb, 600 shards should be a max.

How much HEAP do you have?

The heap size is set to 24GB. From that, 900 shards for a single node are too much.
Is it just a coincidence that the issue is seen after the disk expansion? From what you are saying, it almost seems like it.
Is there an easy way to reduce the shards per index?

There's the shrink API.
Otherwise you need to reindex. The reindex API might help.

What is the output of:

GET /
GET /_cat/indices?v

If some outputs are too big, please share them on gist.github.com and link them here.

Once I get the info, I will share them at gist.github.com.

The interesting part is that I have a much smaller in scale setup (only 4GB of heap) and this system has 1500 shards (no issues there)
curl -XGET localhost:9200/_cluster/health?pretty
{
"cluster_name" : "ClusterX",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 1501,
"active_shards" : 1501,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

Granted, in this setup the data stored in ES is really small compared to the system, which has the issue:
Small setup:
curl -XGET 'http://localhost:9200/_cat/allocation?v'
shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
1501 1.8gb 16.9gb 91gb 107.9gb 15 127.0.0.1 127.0.0.1 DB0h-QU

I don't have the disk usage of the other system.

Could the amount of data cause the issue?

Thanks

Have a look at this blog post.

See the output of the ES queries:

Thanks

You have a lot of "small" indices with 5 shards. Let's take this one as an example:

green  open   app_3-2020.04.12                           c4Q2Ik3YTDSYXbWkiXMBSw   6   0     736972            0     44.2mb         44.2mb

6 shards for 44.2mb is a waste of ressources. It's like having 6 MySQL database running on the same physical machine to hold 44 mb of data...

The biggest one seems to be:

green  open   app_2-2020.04.10                           PPTEt1atTG6MlG3Pl7jqpg   6   0  407957499            0     33.3gb         33.3gb

I think that 1 or 2 shards should be enough to manage them.

So in general, if you reduce all indices from 6 shards to 1 shard, you will end up with around 250 shards for your node which will be probably fine.

Hope this helps.

Thanks David.
yes, I agree there are a lot of indices, where we can improve. That helps.
One more question: Having an index with a size of about 300 gb. Are 6 shards fine for those?

Ideal shard max size is from 20 to 50gb depending on your own tests.
It could work but may be you'll need more.

The issue was that there were indices with a size of a couple of mb, but there were a lot of files 3k+, created for those. There even was one index, which had 2300 documents, but 3700 files were created. After shrinking this index to just one shard reduced the number of files to less 200 files.
How is it possible that so many files were created for such a small index ?

Lucene needs some files to run. Also Lucene has this concept of segments which can be merged when needed or forcely merged if you call the force merge API.

Plus the transaction log...

All that can leads to many files.

One reason might’ve that you are creating a lot of small segments by frequently updating the same document or perhaps have scroll queries interfere with merging. How do you use that index?

Those indices get updated only for 24 hours. Once an hour a series of queries is performed against a specific keys and a document (one document per key) of the index is updated with the retrieved aggregation. So in that specific example, we only had 2300 documents. Those documents were updated once an hour.

Hi David,
I think this is a very good idea to use the _forcemerge API for indices, which are not updated anymore. That could save a lot IO resources in the long run, meaning if there is a requirement to keep data for a very long time (365 d +) for historical reasons.