We recently upgraded to 6.1.3 which is working great and has shown some minor gains in performance over 5.6.3.
Today our cluster had a minor outage during a segment merge, the logs for that time period report "too many open files". I double checked the OS setting and verified ES honored that setting via _node/stats (max_file_descriptors: 65535). When I say outage I mean the cluster reported RED and failed to respond to health checks (which set off pagers).
The average open files right now is between 2000 and 3000 across the cluster. I can see via Kibana that during the outage, for the index actively ingesting data, the segment count dropped from around 1000 to around 500 and then slowly begins the uphill climb again.
I expect there is a setting that controls the merge intensity to prevent a repeat outage. The only non-default setting I've applied that possibly relates to this is:
"index.merge.scheduler.max_thread_count" : 6
Suggestions appreciated, thanks for the help.