Too many files open, recovering cluster

Felipe_Santos · November 19, 2015, 12:24pm

3 x master node 8GB 2vCPU
3X data note 30GB 8vCPU

I am recovering a cluster and I am getting this error

2015-11-18 02:45:20,374][WARN ][action.bulk ] [Bruiser] failed to perform indices:data/write/bulk[s] on remote replica [Douglas Birely][qTqgsv_STVG3je5Fn7tEeg][zupme-1b-elasticsearch003.aws.zup.com.br][inet[/***]][events-vivo-2-20151118][1] org.elasticsearch.transport.RemoteTransportException: [Douglas Birely][inet[/*****]][indices:data/write/bulk[s][r]] Caused by: org.elasticsearch.index.engine.CreateFailedEngineException: [events-vivo-2-20151118][1] Create failed for [events#AVEY6UeqCpZT8FxkSXyC] at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:264) at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:483) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:569) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:250) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:229) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /data/elasticsearch/zupme/nodes/0/indices/events-vivo-2-20151118/1/index/_gb.fdt (Too many open files) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) .....

File descriptor are set to high values.

{ "cluster_name" : "zupme", "nodes" : { "BnoAILz0Q3KQjSWtE2KNKw" : { "name" : "Sebastian Shaw", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 18701, "max_file_descriptors" : 131072, "mlockall" : true } }, "o__dvgL7QfyIM-jRwlJlHg" : { "name" : "Milan", "transport_address" : "inet[****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[******]", "attributes" : { "data" : "false", "master" : "true" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 16397, "max_file_descriptors" : 65536, "mlockall" : true } }, "C11qTS23R5aX2t6TTSCGSA" : { "name" : "Seeker", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 7885, "max_file_descriptors" : 131072, "mlockall" : true } }, "o_5zCidLQpiXsjJGsBPbXw" : { "name" : "Phantom Eagle", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "data" : "false", "master" : "true" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 16777, "max_file_descriptors" : 65536, "mlockall" : true } }, "QbbbxsqmTlWpSRtWzdIhgg" : { "name" : "Lilith, the Daughter of Dracula ", "transport_address" : "inet[****]", "host" : "****", "ip" : "****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 25335, "max_file_descriptors" : 131072, "mlockall" : true } } } }

If I have too many indices, but these indices is not beeing use now, neither for search nor index, It remains file descriptors open?

Felipe_Santos · November 19, 2015, 1:39pm

All the machines are mostly idle

jpountz · November 19, 2015, 2:03pm

Yes. Indices need to be closed in order to use fewer file descriptors.

Felipe_Santos · November 19, 2015, 2:10pm

Could you explain why It remain open if it is not beeing used?

jpountz · November 19, 2015, 2:16pm

This is how databases work in general. Opening files is a costly operation, so elasticsearch opens files when opening the index and then keeps them open.

Felipe_Santos · November 19, 2015, 2:53pm

I am recovering the cluster,

{
"cluster_name" : "zupme",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 15125,
"active_shards" : 24647,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 5599,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 6845,
"number_of_in_flight_fetch" : 0
}

And when unassigned_shards achieve < 700 it raise too many open files and stop on this numeber(cpu is mostly idle), the only way to solve this is to close indices? The master has one CPU at 100% but others CPUs are idle, and recovery is too slow

jpountz · November 19, 2015, 3:00pm

This is a lot of shards for only 3 data nodes, you should try to have fewer indices and/or fewer shards per index. I'm afraid open files are just the first thing that breaks, but even if you were able to fix this issue eg. by letting the OS allocate more open files, something else would break.

Felipe_Santos · November 19, 2015, 3:01pm

There is some document that has a formula to get number of shards? I don't think we can have fewer indices

Felipe_Santos · November 19, 2015, 3:06pm

And why reallocating unassigned_shards are too slow 1 shard per second, and none errors on logs?

jpountz · November 19, 2015, 3:07pm

There isn't really a formula, this would depend on the hardware, mappings, etc. but hundreds of shards per node is already a lot.

Why can't you have fewer indices? Sometimes you can share data eg. for several users in the same index. See eg. https://vimeo.com/44716955 from 13'45

Felipe_Santos · November 19, 2015, 3:10pm

Because an user could have millions of event per day, so the search and index will be slow. I will take a look the video..

Thanks a lot

Felipe_Santos · November 19, 2015, 3:50pm

Any tips why cluster recovery is too slow? And CPUs are mostly idle.. Its because the number of shards?

Topic		Replies	Views
Too many opened files Elasticsearch	15	7243	May 2, 2017
Too Many Open Files - Already set max files Elasticsearch	5	4328	July 5, 2017
Too many open files warning Elasticsearch	9	3634	July 6, 2017
ES Keeps Falling Over Elasticsearch	5	496	July 6, 2017
Too many open files even after increasing limit Elasticsearch	8	540	July 6, 2017

Too many files open, recovering cluster

Related topics