I have followed the guidelines for setting maximum open file descriptors to 65k and verified that they are set that way on my cluster, however after a crash last night where we lost a master node, our cluster will not recover and is throwing tons of "Too many open files" errors.
Example error:
[2016-11-01 15:24:38,779][WARN ][cluster.action.shard ] [NODE] [INDEX1][0] received shard failed for [INDEX1][0], node[7ASpd2rMT1iORDwmSE_7Ug], [R], v[82], s[STARTED], a[id=G2r6AvbyQkiE9T9Qo5kAtA], indexUUID [T9woubwETvGmWAuLQn9xpA], message [failed to perform indices:data/write/bulk[s] on replica on node {NODE}{7ASpd2rMT1iORDwmSE_7Ug}{IP}{IP:9300}{master=false}], failure [RemoteTransportException[[NODE][IP:9300][indices:data/write/bulk[s][r]]]; nested: ElasticsearchException[failed to sync translog]; nested: NotSerializableExceptionWrapper[PATH/translog/translog.ckp: Too many open files]; ]
Example node stats:
"7ASpd2rMT1iORDwmSE_7Ug" : {
"timestamp" : 1478013569336,
"name" : "NODE_NAME",
"transport_address" : "IP:9300",
"host" : "NODE_NAME",
"ip" : [ "IP:9300", "NONE" ],
"attributes" : {
"master" : "false"
},
"process" : {
"timestamp" : 1478013569336,
"open_file_descriptors" : 64858,
"max_file_descriptors" : 65000,
"cpu" : {
"percent" : 97,
"total_in_millis" : 1470798750
},
"mem" : {
"total_virtual_in_bytes" : 27855085568
}
}
}
What kind of change to my cluster does this indicate I need to make for stability?? Do we need more nodes?? Do we need less indices & shards??
Is there an api call to up the maximum number of file descriptors on a live elasticsearch process?? Or is the only way to change the system settings and then restart elasticsearch? I'd like to avoid having to restart data nodes if possible.
Help appreciated... Thank you!