Too many opened files

Hi,
I have too many files error issue in ES cluster. We are getting below error.

 2017-03-30 13:51:34,079][WARN ][cluster.action.shard     ] [data-node-1] [system1-2017.03.30][2] received shard failed for [system1-2017.03.30][2], node[sBLY1t2qS_uNv-OYbzD98w], [R], s[INITIALIZING], indexUUID [ouR2g5KuQ6-62iq8O64PXQ], reason [shard failure [failed recovery][RecoveryFailedException[[system1-2017.03.30][2]: Recovery failed from [data-node-2][FkubAmgDSjm93WiKOrnnTg][search02][inet[/xxx.xxx.xx.xx:9300]]{master=false} into [data-node-1][sBLY1t2qS_uNv-OYbzD98w][LogSearch01][inet[/xxx.xxx.xx.xx:9300]]{master=true}]; nested: RemoteTransportException[[data-node-2][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[system1-2017.03.30][2] Phase[2] Execution failed]; nested: RemoteTransportException[[data-node-1][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[[system1-2017.03.30][2] failed to open reader on writer]; nested: FileSystemException[/log/elasticsearch/prod-elasticsearch/nodes/0/indices/system1-2017.03.30/2/index/_u_Lucene41_0.tim: Too many open files]; ]]
    [2017-03-30 13:51:34,081][WARN ][indices.cluster          ] [data-node-1] [[system1-2017.03.30][2]] marking and sending shard failed due to [failed recovery]
    org.elasticsearch.indices.recovery.RecoveryFailedException: [system1-2017.03.30][2]: Recovery failed from [data-node-2][FkubAmgDSjm93WiKOrnnTg][search02][inet[/xxx.xxx.xx.xx:9300]]{master=false} into [data-node-1][sBLY1t2qS_uNv-OYbzD98w][LogSearch01][inet[/xxx.xxx.xx.xx:9300]]{master=true}
        at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
        at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
        at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: org.elasticsearch.transport.RemoteTransportException: [data-node-2][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/start_recovery]
    Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [system1-2017.03.30][2] Phase[2] Execution failed
        at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:861)
        at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
        at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
        at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
        at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

We have below cluster configuration:

  1. There are 3 ES nodes with 16 GB memory on each server and 9 GB allocated to ES service on each node. One is master+Data, other 2 nodes are only data nodes.
  2. We have time based indexing which creates indexes everyday. Example, system1-2017-04-03 and system1-2017-04-04, system1-2017-04-05 and so on........
  3. We have around 20 such systems. Logs from those many systems are stored in new index, every day.
  4. We have 3 replicas and 3 shards settings for every index.
  5. cat /proc/sys/fs/file-max
    512000
    ulimit -n
    65535

I want to know,
Whether time based indexing is correct or wrong?

is the shard and replica setting creating too many shards? Should we reduce the replica and shards? What is trade off analysis for shards and replica for best performance in case of 3 nodes?

Is there any documentation which clearly explains, young, old and survivor space?

If I allocate 9 GB to ES, then how that 9 GB space will be allocated to each one of old, young and survivor space?

Br,
Sunil.

How many indices you have in total?

Lots of small shards and indices are very inefficient and wastes resources. A good general guideline is to aim for average shard sizes of between a few GB to a few tens of GB. As you have many applications and indices, you may therefore want to consolidate these into fewer shared indices and/or reduce the number of primary shards per index. If you have a long retention period you may also want to consider weekly or monthly indices instead of daily ones in order to get the shard size up, but this depends on your daily data volumes.

This is not related to your issue BTW.

Total indices: 1910
Total shards: 16631
Document count: 348,151,431

It's 5500 shards per node. A way too much!

You definitely need to reduce the pressure on your nodes.

2 choices:

  • increase the number of nodes
  • reduce the number of shards

What is the typical size (disk space) of a single shard?

1 Like

That depends on the index(system1).
Some has shard size of 1.4 GB and some may be more than that and few shards having size in only KBs

How did you conclude that this is too much pressure on the node.
Is there any trade-off analysis to decide how many shards a cluster can have?
Is there any relation between no of shards, file-descriptors and ulimit?

Thanks Christian Dahlqvist.

Currently our cluster in running. If I want to change the shards and replica strategy, can I do that in running cluster?
Will there be any risk for older indices (more shards and more replicas) while searching?
Will my cluster be stable after this change?
Does it need any restart?

Yes.

Well. To give you a sort of comparison it's like to me running 5000 Database instances on the same physical machine. I would never do that! :slight_smile:

Hi,
Can you please explain that relation?

If you are on Elasticsearch 5.x you should be able to use the shrink API to reduce the primary shard count for all indices to 1. If you have a lot of very small indices, it is probably better to consolidate these by reindexing the data into monthly or weekly indices.

There is no exact limit for the number of shards a node can handle. If you can get the shard sizes up to the levels I described earlier, you will be able to hold more data per node without running into problems. For logging use cases with nodes having ~30GB heap I generally recommend having hundreds of shards per node, not thousands. As you have a significantly smaller heap I would expect your nodes to have fewer shards than that, but this is a simple rule of thumb.

https://www.elastic.co/guide/en/elasticsearch/reference/current/file-descriptors.html

Elasticsearch uses a lot of file descriptors or file handles. Running out of file descriptors can be disastrous and will most probably lead to data loss.

Each shard is a Lucene instance. Each Lucene instance uses a lot of file descriptors. If you have a lot of shards per node, you will probably overload that.

Hi David,
Thanks.
Now I am close to the solution of this.

If I reduce the no of primary shards, and replica, what will be effect on the previous indexes which has more shards and replicas? Will those be participating normally in searching?

If I close some indexes (Not Delete), then will that reduce, the no active shards?

No effect on existing indices. That said when the situation is stabilized I'd recommend using reindex API to reindex your data in smaller indices.

Closing helps a lot but be aware that they are not managed anymore which means no replication of shards when a node goes down.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.