From my observations, the above is not the key measure.
Here is what I am seeing:
[root@localhost ~]# curl -XGET 'localhost:9600/_node/stats/process?pretty'
{
"host" : "localhost",
"version" : "5.6.4",
"http_address" : "127.0.0.1:9600",
"id" : "82369e7f-0a82-4717-8944-5b9731d958ce",
"name" : "logstashPROD",
"process" : {
"open_file_descriptors" : 14000,
"peak_open_file_descriptors" : 14000,
"max_file_descriptors" : 16384,
"mem" : {
"total_virtual_in_bytes" : 4886056960
},
"cpu" : {
"total_in_millis" : 10504830,
"percent" : 0,
"load_average" : {
"1m" : 0.0,
"5m" : 0.01,
"15m" : 0.0
}
}
[ }
}
Note "open_file_descriptors" and the "peak_open_file_descriptors" , we see those spike over 1000 (when they are normally between 100 and 500). When that measure spikes to four or five digits, logstash is on its way down (or is already down)
When the above counter is at 5 digits, the "lsof -p LogstashPID| wc -l" command still returns a much smaller number (under 400).
For example, in the case above, logstash was down by the time that hit 10000. (Prob much earlier)
But I cannot yet figure out what to do from here.
It is clearly a logstash bug, but no one at elastic seems to care.