Hi,
We are using Elastic search version 6.5.4 , and with 3 Master , 2 Data & 5 Client pods on a Kubernetes environment.
We could see there are 155 unassigned shards , when checked , it shows the below error:
[root@csf-cint-control-01 cloud-user]# curl -k 'https://elasticsearch.paas:9200/_cluster/allocation/explain?pretty' -H "x-forwarded-for: XXXX" -H "x-forwarded-by: XXXX" -H "x-proxy-user: admin" -H "x-proxy-roles: admin"
{
"index" : "log-cmdb-log-2019.07.27",
"shard" : 4,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2019-07-27T12:30:18.073Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [nGWyEuPITxWkCic6YbQesA]: failed recovery, failure RecoveryFailedException[[log-cmdb-log-2019.07.27][4]: Recovery failed from {csf-compaas-cluster-belk-elasticsearch-data-1}{46hhOMTVT0eZ5Bc6AZ4Emg}{HM0PBBboQ7-c8i38JVm9IA}{192.168.51.123}{192.168.51.123:9300} into {csf-compaas-cluster-belk-elasticsearch-data-0}{nGWyEuPITxWkCic6YbQesA}{roFKY35nSTap2WwRya5VyA}{192.168.85.68}{192.168.85.68:9300}]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-1][192.168.51.123:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-0][192.168.85.68:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/data/data/nodes/0/indices/toVMYLWeTh-VkUuvhHmhHg/4/translog/translog.ckp: Too many open files in system]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "46hhOMTVT0eZ5Bc6AZ4Emg",
"node_name" : "csf-compaas-cluster-belk-elasticsearch-data-1",
"transport_address" : "192.168.147.139:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-07-27T12:30:18.073Z], failed_attempts[5], delayed=false, details[failed shard on node [nGWyEuPITxWkCic6YbQesA]: failed recovery, failure RecoveryFailedException[[log-cmdb-log-2019.07.27][4]: Recovery failed from {csf-compaas-cluster-belk-elasticsearch-data-1}{46hhOMTVT0eZ5Bc6AZ4Emg}{HM0PBBboQ7-c8i38JVm9IA}{192.168.51.123}{192.168.51.123:9300} into {csf-compaas-cluster-belk-elasticsearch-data-0}{nGWyEuPITxWkCic6YbQesA}{roFKY35nSTap2WwRya5VyA}{192.168.85.68}{192.168.85.68:9300}]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-1][192.168.51.123:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-0][192.168.85.68:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/data/data/nodes/0/indices/toVMYLWeTh-VkUuvhHmhHg/4/translog/translog.ckp: Too many open files in system]; ], allocation_status[no_attempt]]]"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[log-cmdb-log-2019.07.27][4], node[46hhOMTVT0eZ5Bc6AZ4Emg], [P], s[STARTED], a[id=S8H3uyquSUC9FyIegXh9HA]]"
}
]
},
{
"node_id" : "nGWyEuPITxWkCic6YbQesA",
"node_name" : "csf-compaas-cluster-belk-elasticsearch-data-0",
"transport_address" : "192.168.73.142:9300",
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-07-27T12:30:18.073Z], failed_attempts[5], delayed=false, details[failed shard on node [nGWyEuPITxWkCic6YbQesA]: failed recovery, failure RecoveryFailedException[[log-cmdb-log-2019.07.27][4]: Recovery failed from {csf-compaas-cluster-belk-elasticsearch-data-1}{46hhOMTVT0eZ5Bc6AZ4Emg}{HM0PBBboQ7-c8i38JVm9IA}{192.168.51.123}{192.168.51.123:9300} into {csf-compaas-cluster-belk-elasticsearch-data-0}{nGWyEuPITxWkCic6YbQesA}{roFKY35nSTap2WwRya5VyA}{192.168.85.68}{192.168.85.68:9300}]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-1][192.168.51.123:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[csf-compaas-cluster-belk-elasticsearch-data-0][192.168.85.68:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/data/data/nodes/0/indices/toVMYLWeTh-VkUuvhHmhHg/4/translog/translog.ckp: Too many open files in system]; ], allocation_status[no_attempt]]]"
}
]
}
]
}
What does the error "Too many open files in system" means here ?
We read from the link (Too many open files - #4 by onthefloorr) that ,we can check the "nofile" setting in /etc/security/limits.conf for the maximum limit of number of file descriptors. We can change it with superuser privileges.
We are running elasticsearch in the pod, we do not have root privileges to modify such system-level parameters inside the pod on a running cluster . Can you please suggest what can be done to solve this without having to rebuild my docker image?
when we check using _nodes/stats/process , "max_file_descriptors" parameter is already set to 1048576.
Could you please help on this issue.
Thanks.