Hi,
I have too many files error issue in ES cluster. We are getting below error.
2017-03-30 13:51:34,079][WARN ][cluster.action.shard ] [data-node-1] [system1-2017.03.30][2] received shard failed for [system1-2017.03.30][2], node[sBLY1t2qS_uNv-OYbzD98w], [R], s[INITIALIZING], indexUUID [ouR2g5KuQ6-62iq8O64PXQ], reason [shard failure [failed recovery][RecoveryFailedException[[system1-2017.03.30][2]: Recovery failed from [data-node-2][FkubAmgDSjm93WiKOrnnTg][search02][inet[/xxx.xxx.xx.xx:9300]]{master=false} into [data-node-1][sBLY1t2qS_uNv-OYbzD98w][LogSearch01][inet[/xxx.xxx.xx.xx:9300]]{master=true}]; nested: RemoteTransportException[[data-node-2][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[system1-2017.03.30][2] Phase[2] Execution failed]; nested: RemoteTransportException[[data-node-1][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/prepare_translog]]; nested: EngineCreationFailureException[[system1-2017.03.30][2] failed to open reader on writer]; nested: FileSystemException[/log/elasticsearch/prod-elasticsearch/nodes/0/indices/system1-2017.03.30/2/index/_u_Lucene41_0.tim: Too many open files]; ]]
[2017-03-30 13:51:34,081][WARN ][indices.cluster ] [data-node-1] [[system1-2017.03.30][2]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [system1-2017.03.30][2]: Recovery failed from [data-node-2][FkubAmgDSjm93WiKOrnnTg][search02][inet[/xxx.xxx.xx.xx:9300]]{master=false} into [data-node-1][sBLY1t2qS_uNv-OYbzD98w][LogSearch01][inet[/xxx.xxx.xx.xx:9300]]{master=true}
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [data-node-2][inet[/xxx.xxx.xx.xx:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [system1-2017.03.30][2] Phase[2] Execution failed
at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:861)
at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:699)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
We have below cluster configuration:
- There are 3 ES nodes with 16 GB memory on each server and 9 GB allocated to ES service on each node. One is master+Data, other 2 nodes are only data nodes.
- We have time based indexing which creates indexes everyday. Example, system1-2017-04-03 and system1-2017-04-04, system1-2017-04-05 and so on........
- We have around 20 such systems. Logs from those many systems are stored in new index, every day.
- We have 3 replicas and 3 shards settings for every index.
- cat /proc/sys/fs/file-max
512000
ulimit -n
65535
I want to know,
Whether time based indexing is correct or wrong?
is the shard and replica setting creating too many shards? Should we reduce the replica and shards? What is trade off analysis for shards and replica for best performance in case of 3 nodes?
Is there any documentation which clearly explains, young, old and survivor space?
If I allocate 9 GB to ES, then how that 9 GB space will be allocated to each one of old, young and survivor space?
Br,
Sunil.