Too many files open, recovering cluster


(Felipe Santos) #1

3 x master node 8GB 2vCPU
3X data note 30GB 8vCPU

I am recovering a cluster and I am getting this error

2015-11-18 02:45:20,374][WARN ][action.bulk ] [Bruiser] failed to perform indices:data/write/bulk[s] on remote replica [Douglas Birely][qTqgsv_STVG3je5Fn7tEeg][zupme-1b-elasticsearch003.aws.zup.com.br][inet[/***]][events-vivo-2-20151118][1] org.elasticsearch.transport.RemoteTransportException: [Douglas Birely][inet[/*****]][indices:data/write/bulk[s][r]] Caused by: org.elasticsearch.index.engine.CreateFailedEngineException: [events-vivo-2-20151118][1] Create failed for [events#AVEY6UeqCpZT8FxkSXyC] at org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:264) at org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:483) at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:569) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:250) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:229) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /data/elasticsearch/zupme/nodes/0/indices/events-vivo-2-20151118/1/index/_gb.fdt (Too many open files) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:162) .....

File descriptor are set to high values.

{ "cluster_name" : "zupme", "nodes" : { "BnoAILz0Q3KQjSWtE2KNKw" : { "name" : "Sebastian Shaw", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 18701, "max_file_descriptors" : 131072, "mlockall" : true } }, "o__dvgL7QfyIM-jRwlJlHg" : { "name" : "Milan", "transport_address" : "inet[****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[******]", "attributes" : { "data" : "false", "master" : "true" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 16397, "max_file_descriptors" : 65536, "mlockall" : true } }, "C11qTS23R5aX2t6TTSCGSA" : { "name" : "Seeker", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 7885, "max_file_descriptors" : 131072, "mlockall" : true } }, "o_5zCidLQpiXsjJGsBPbXw" : { "name" : "Phantom Eagle", "transport_address" : "inet[*****]", "host" : "*****", "ip" : "*****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "data" : "false", "master" : "true" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 16777, "max_file_descriptors" : 65536, "mlockall" : true } }, "QbbbxsqmTlWpSRtWzdIhgg" : { "name" : "Lilith, the Daughter of Dracula ", "transport_address" : "inet[****]", "host" : "****", "ip" : "****", "version" : "1.7.2", "build" : "e43676b", "http_address" : "inet[****]", "attributes" : { "master" : "false" }, "process" : { "refresh_interval_in_millis" : 1000, "id" : 25335, "max_file_descriptors" : 131072, "mlockall" : true } } } }

If I have too many indices, but these indices is not beeing use now, neither for search nor index, It remains file descriptors open?


(Felipe Santos) #2

All the machines are mostly idle


(Adrien Grand) #3

Yes. Indices need to be closed in order to use fewer file descriptors.


(Felipe Santos) #4

Could you explain why It remain open if it is not beeing used?


(Adrien Grand) #5

This is how databases work in general. Opening files is a costly operation, so elasticsearch opens files when opening the index and then keeps them open.


(Felipe Santos) #6

I am recovering the cluster,

{
"cluster_name" : "zupme",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 15125,
"active_shards" : 24647,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 5599,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 6845,
"number_of_in_flight_fetch" : 0
}

And when unassigned_shards achieve < 700 it raise too many open files and stop on this numeber(cpu is mostly idle), the only way to solve this is to close indices? The master has one CPU at 100% but others CPUs are idle, and recovery is too slow


(Adrien Grand) #7

This is a lot of shards for only 3 data nodes, you should try to have fewer indices and/or fewer shards per index. I'm afraid open files are just the first thing that breaks, but even if you were able to fix this issue eg. by letting the OS allocate more open files, something else would break.


(Felipe Santos) #8

There is some document that has a formula to get number of shards? I don't think we can have fewer indices


(Felipe Santos) #9

And why reallocating unassigned_shards are too slow 1 shard per second, and none errors on logs?


(Adrien Grand) #10

There isn't really a formula, this would depend on the hardware, mappings, etc. but hundreds of shards per node is already a lot.

Why can't you have fewer indices? Sometimes you can share data eg. for several users in the same index. See eg. https://vimeo.com/44716955 from 13'45


(Felipe Santos) #11

Because an user could have millions of event per day, so the search and index will be slow. I will take a look the video..

Thanks a lot


(Felipe Santos) #12

Any tips why cluster recovery is too slow? And CPUs are mostly idle.. Its because the number of shards?


(system) #13