Too many open files in shards allocation

Edouard_Fazenda · August 21, 2018, 9:42am

Dear Community,

I have the following issue since few days i have some unassigned shards due to allocation faillure, so the cluster goes to yellow and sometimes the cluster go to red as the elasticsearch service goes down.

I have the following architecture :

1 Master node
2 Data node (include the master)
2 Client Node

Total 4 Elasticsearch nodes.

1.7 k indices
1.9 b documents
2.9 TB of data
The configuration is the following on the data nodes : 5 Shards + 1 Complete Replica

[root@elastic-xx ~]# curl -X GET http://elastic-xx.domain.local:9200/_cluster/allocation/explain?pretty
{
  "index" : "winlogbeat-2018.08.20",
  "shard" : 2,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2018-08-20T18:59:53.330Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ",
    "last_allocation_status" : "no_attempt" 
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "QrlS-a6EThqug57OnWPdmg",
      "node_name" : "elastic-xx",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-08-20T18:59:53.330Z], failed_attempts[5], delayed=false, details[failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{10.10.68.24}{10.10.68.24:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ], allocation_status[no_attempt]]]" 
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[winlogbeat-2018.08.20][2], node[QrlS-a6EThqug57OnWPdmg], [P], s[STARTED], a[id=EFCtso3bSP2dDU9VbbDfOw]]" 
        }
      ]
    },
    {
      "node_id" : "uq3vVKvkRPKC94OOgfyrCA",
      "node_name" : "elastic-xx",
      "transport_address" : "xxx.xxx.xxx.xxx:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2018-08-20T18:59:53.330Z], failed_attempts[5], delayed=false, details[failed shard on node [uq3vVKvkRPKC94OOgfyrCA]: failed recovery, failure RecoveryFailedException[[winlogbeat-2018.08.20][2]: Recovery failed from {elastic-02}{QrlS-a6EThqug57OnWPdmg}{oOMM9zM_QESlyThKrefNGA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300} into {elastic-01}{uq3vVKvkRPKC94OOgfyrCA}{2anUChAIQF-xHvw4kUzBRA}{xxx.xxx.xxx.xxx}{xxx.xxx.xxx.xxx:9300}]; nested: RemoteTransportException[[elastic-xx][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-01][xxx.xxx.xxx.xxx:9300][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[failed to create engine]; nested: FileSystemException[/elk/elasticsearch/nodes/0/indices/8w4QBGSVTRq7jopP-n7L-w/2/translog/translog-3886.ckp: Too many open files]; ], allocation_status[no_attempt]]]" 
        }
      ]
    }
  ]
}

All the nodes in the cluster have the max_file_descriptors to 65536

{
    "nodes": {
        "DIyPbW4WQoSHFrtAYc0gmA": {
            "process": {
                "max_file_descriptors": 65536
            }
        },
        "QrlS-a6EThqug57OnWPdmg": {
            "process": {
                "max_file_descriptors": 65536
            }
        },
        "uq3vVKvkRPKC94OOgfyrCA": {
            "process": {
                "max_file_descriptors": 65536
            }
        },
        "zxSDt-cRTeOIQwNWzJgZWA": {
            "process": {
                "max_file_descriptors": 65536
            }
        }
    }
}

I thought addind data node , but i do not want to have more replica and consume much storage.

Do you have any recomendations on how to optimise and fix the issue ?

Thanks a lot for your help.

Best Regards, Edouard Fazenda.

Christian_Dahlqvist · August 21, 2018, 9:44am

You have far too many indices and shards given the size of your data and cluster. Please readthis blog post for some practical guidance around shards and sharding.

Edouard_Fazenda · August 21, 2018, 11:17am

Thank you i will review this article and get back to you if i have more questions.

Edouard_Fazenda · August 21, 2018, 11:31am

Ok I understand I have 8400 shards due to the 5 Shards + 1 Replication on 2 data nodes.

The 2 nodes have 24 GB of Heap for Elasticsearch

On the blog article it says that 25 Shards per GB of Heap, so it's about 25 * 24 GB = 600 Shards per nodes , as i have 2 nodes it should be 1200 Shards, I effectively far far to much shards.

If i am not wrong to respect the elasticsearch recomendation i need to have 14 data nodes of 24 GB of Heap

Or I can go in reducing the number of shards on the cluster via the srink API call
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/indices-shrink-index.html

And go ahead with having a more strict retention.

Am i right ?

Thanks.

Christian_Dahlqvist · August 21, 2018, 11:45am

It looks like your average shard size is quite small, so I would recommend the following:

Change your index templates to have a single primary shard.
Consider switching from daily to perhaps weekly or monthly indices in order to get the average shard size up.
Delete data that is no longer needed and use the shrink API to shrink existing indices down to a single primary shard.
If this is not enough, you may need to reindex older data into indices covering a longer time period.

Edouard_Fazenda · August 21, 2018, 11:52am

Ok thanks a lot for the recommendations !

system · September 18, 2018, 11:52am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Too many opened files Elasticsearch	15	7236	May 2, 2017
Too many open files even after increasing limit Elasticsearch	8	525	July 6, 2017
Error about too many open files when allocating shard Elasticsearch	6	1065	May 14, 2018
Too Many Open Files - Already set max files Elasticsearch	5	4309	July 5, 2017
UNASSIGNED ALLOCATION_FAILED Elasticsearch	19	13508	June 4, 2020

Too many open files in shards allocation

Related topics