I have a Elasticsearch 6.2.3 setup (single node). The system was running smoothly for a period of time. Needed to expand the hard-drive, since the disk space was about to run out. After the expansion, Elasticsearch came up, but after a period of time, the health status turned red:
curl -s 'localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "ClusterX",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 880,
"active_shards" : 880,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 15,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 98.32402234636871
}
The reason is UNASSIGNED ALLOCATION_FAILED:
curl -XGET localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 53499 100 53499 0 0 17127 0 0:00:03 0:00:03 --:--:-- 17130
indexA-2020.03.12 1 p UNASSIGNED ALLOCATION_FAILED
indexA-2020.03.12 4 p UNASSIGNED ALLOCATION_FAILED
...
_cluster/allocation call says that too many files are open:
curl -XGET localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "indexA-2020.03.12",
"shard" : 4,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-04-22T10:40:10.827Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [MXMUEqXRTxKfiSQCeQQvTg]: failed recovery, failure RecoveryFailedException[[indexA-2020.03.12][4]: Recovery failed on {MXMUEqX}{MXMUEqXRTxKfiSQCeQQvTg}{wsB9s7OWSRq7zLXX26d5NQ}{127.0.0.1}{127.0.0.1:9300}{rack_id=r1}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/opt/elasticsearch/nodes/0/indices/2G7F7nCQT7y_yMCybVV9xQ/4/translog/translog-251.ckp: Too many open files]; ",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "MXMUEqXRTxKfiSQCeQQvTg",
"node_name" : "MXMUEqX",
"transport_address" : "127.0.0.1:9300",
"node_attributes" : {
"rack_id" : "r1"
},
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "_YsZ8lYWT069lYzts-Eg2A"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-04-22T10:40:10.827Z], failed_attempts[5], delayed=false, details[failed shard on node [MXMUEqXRTxKfiSQCeQQvTg]: failed recovery, failure RecoveryFailedException[[indexA-2020.03.12][4]: Recovery failed on {MXMUEqX}{MXMUEqXRTxKfiSQCeQQvTg}{wsB9s7OWSRq7zLXX26d5NQ}{127.0.0.1}{127.0.0.1:9300}{rack_id=r1}]; nested: IndexShardRecoveryException[failed to recover from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/opt/elasticsearch/nodes/0/indices/2G7F7nCQT7y_yMCybVV9xQ/4/translog/translog-251.ckp: Too many open files]; ], allocation_status[deciders_no]]]"
}
]
}
]
}
Once, I truncated the tanslog with elasticsearch-truncate, the system comes up, but again, after a while, the same issue occurs, index from a different date.
All those dates are before the date, when the partition was expanded.
Any idea, how to permanently fix this?
Thanks