Logstash failing with "ERROR::EMFILE- Too many open files"

Hi. I'm having issues running logstash withi large sets of data. I'm trying to push data to Elasticsearch via Logstash Where both the applications are hosted on CENTOS servers. I'm trying to send 1000 files per minute (each of size 2KB) for 1 hour. But I'm facing the following issue after sending 3992 (exactly) files(ie., in four minutes).

[2019-07-16T06:19:36,398][DEBUG][logstash.instrument.periodicpoller.cgroup] Error, cannot retrieve cgroups information {:exception=>"Errno::EMFILE", :message=>"Too many open files - Too many open files"

I've tried increasing the system limits too. I've increased the pipeline workers to 6 as well as the batch size to 1000.I Don't have any other idea of what else to change or whether I've made something wrong in the configuration. Thanks in advance.

What does your input configuration look like?

input {
  file {
    path => ["/home/stats/archive/*.csv"]
    start_position => "beginning"
    max_open_files => 40000
#    sincedb_path => "/tmp/.productsSince15.db"
  }
}
filter {
  csv {
    separator => ","
    skip_header => true
    columns => ["mitigation_id","time","total_in_packets","total_out_packets","total_dropped_packets","total_in_bytes","total_out_bytes","total_dropped_bytes","invalid_pkts_in_packets","invalid_pkts_out_packets","invalid_pkts_dropped_packets","invalid_pkts_in_bytes","invalid_pkts_out_bytes","invalid_pkts_dropped_bytes","acl_in_packets","acl_out_packets","acl_dropped_packets","acl_in_bytes","acl_out_bytes","acl_dropped_bytes","zombie_detection_in_packets","zombie_detection_out_packets","zombie_detection_dropped_packets","zombie_detection_in_bytes","zombie_detection_out_bytes","zombie_detection_dropped_bytes","geo_ip_in_packets","geo_ip_out_packets","geo_ip_dropped_packets","geo_ip_in_bytes","geo_ip_out_bytes","geo_ip_dropped_bytes","tcpsyn_in_packets","tcpsyn_out_packets","tcpsyn_dropped_packets","tcpsyn_in_bytes","tcpsyn_out_bytes","tcpsyn_dropped_bytes","payload_in_packets","payload_out_packets","payload_dropped_packets","payload_in_bytes","payload_out_bytes","payload_dropped_bytes", "http_mlfd_in_packets","http_mlfd_out_packets","http_mlfd_dropped_packets","http_mlfd_in_bytes","http_mlfd_out_bytes","http_mlfd_dropped_bytes","dns_mlfd_in_packets","dns_mlfd_out_packets","dns_mlfd_dropped_packets","dns_mlfd_in_bytes","dns_mlfd_out_bytes","dns_mlfd_dropped_bytes","traffic_management_in_packets","traffic_management_out_packets","traffic_management_dropped_packets","traffic_management_in_bytes","traffic_management_out_bytes","traffic_management_dropped_bytes"]
  }
  date {
    match => [ "time", "YYYY-MM-dd HH:mm:ss" ]
    target => "timestamp"
   } 
 
mutate {
    remove_field => [ "message" ]
}

mutate {
  convert => { 
               "mitigation_id" => "integer_eu"
               "total_in_packets" => "integer_eu"
               "total_out_packets" => "integer_eu"
               "total_dropped_packets" => "integer_eu"
               "total_in_bytes" => "integer_eu"
               "total_out_bytes" => "integer_eu"
               "total_dropped_bytes" => "integer_eu"
               "invalid_pkts_in_packets" => "integer_eu"
               "invalid_pkts_out_packets" => "integer_eu"
               "invalid_pkts_dropped_packets" => "integer_eu"
               "invalid_pkts_in_bytes" => "integer_eu"
               "invalid_pkts_out_bytes" => "integer_eu"
               "invalid_pkts_dropped_bytes" => "integer_eu"
               "acl_in_packets" => "integer_eu"
               "acl_out_packets" => "integer_eu"
               "acl_dropped_packets" => "integer_eu"
               "acl_in_bytes" => "integer_eu"
               "acl_out_bytes" => "integer_eu"
               "acl_dropped_bytes" => "integer_eu"
               "zombie_detection_in_packets" => "integer_eu"
               "zombie_detection_out_packets" => "integer_eu"
               "zombie_detection_dropped_packets" => "integer_eu"
               "zombie_detection_in_bytes" => "integer_eu"
               "zombie_detection_out_bytes" => "integer_eu"
               "zombie_detection_dropped_bytes" => "integer_eu"
               "geo_ip_in_packets" => "integer_eu"
               "geo_ip_out_packets" => "integer_eu"
               "geo_ip_dropped_packets" => "integer_eu"
               "geo_ip_in_bytes" => "integer_eu"
               "geo_ip_out_bytes" => "integer_eu"
               "geo_ip_dropped_bytes" => "integer_eu"
               "tcpsyn_in_packets" => "integer_eu"
               "tcpsyn_out_packets" => "integer_eu"
               "tcpsyn_dropped_packets" => "integer_eu"
               "tcpsyn_in_bytes" => "integer_eu"
               "tcpsyn_out_bytes" => "integer_eu"
               "tcpsyn_dropped_bytes" => "integer_eu"
               "payload_in_packets" => "integer_eu"
               "payload_out_packets" => "integer_eu"
               "payload_dropped_packets" => "integer_eu"
               "payload_in_bytes" => "integer_eu"
               "payload_out_bytes" => "integer_eu"
               "payload_dropped_bytes" => "integer_eu"
               "http_mlfd_in_packets" => "integer_eu"
               "http_mlfd_out_packets" => "integer_eu"
               "http_mlfd_dropped_packets" => "integer_eu"
               "http_mlfd_in_bytes" => "integer_eu"
               "http_mlfd_out_bytes" => "integer_eu"
               "http_mlfd_dropped_bytes" => "integer_eu"
               "dns_mlfd_in_packets" => "integer_eu"
               "dns_mlfd_out_packets" => "integer_eu"
               "dns_mlfd_dropped_packets" => "integer_eu"
               "dns_mlfd_in_bytes" => "integer_eu"
               "dns_mlfd_out_bytes" => "integer_eu"
               "dns_mlfd_dropped_bytes" => "integer_eu"
               "traffic_management_in_packets" => "integer_eu"
               "traffic_management_out_packets" => "integer_eu"
               "traffic_management_dropped_packets" => "integer_eu"
               "traffic_management_in_bytes" => "integer_eu"
               "traffic_management_out_bytes" => "integer_eu"
               "traffic_management_dropped_bytes" => "integer_eu"   
  }
}
}
output {
  elasticsearch {
    hosts => "http://localhost:9200"
    index => "typestats-%{+YYYY-MM-dd_HH_mm}"
  }
}

I'm trying to index .csv files in elasticsearch via Logstash.

It sounds like you have not increased the open file limit for the process, so I would look into that. Also, you might want to significantly lower the value of close_older on the file input, and also consider switching to read mode instead of tail mode.

Hi, Sorry for the late reply. Thanks for the heads-up! I did the changes like you said

  • changing the mode to read mode
  • increasing the max open files limit (10000)
  • decreasing the close_older limit (3 minutes)
    This setup was working fine for 4-5 hours. But again the Logstash stopped processing the incoming data after that. While there were 2,88,000 files and more, Logstash was able to process only 1,65,900 files and got stuck at it.
    This is the debug logs

Is there anything I'm missing out?

In tail mode I can see why you would want to tail a large number of files, but in read mode logstash is going to open a file, read the whole file, and process the whole file through the pipeline. That's basically going to keep 1 CPU busy. I would not set the open file limit much above the number of CPUs that you have. If you increase it further then if anything I would expect throughput to get worse.

Are you deleting files and then adding more files? That can result in inode reuse.

There is still so many inodes unused.
Is it still necessary to delete/remove the files for inode reuse?
image

Actually, I'm not deleting any files as of now.

Then the next step would be '--log.level trace' (not debug) which will cause the filewatch code to log voluminously. There will be millions of lines of logs, but you just need to find the lines related to one file that does not get read.

Will try that too and tell you how it works out. Thank you for your valuable time :smiley::smiley:

I tried debugging the issue by enabling log level as "trace".



Is there any issue with SinceDBcollection?

The logs are shipped with so much delay. Initially 0.15 million log files are shipped without any issue, but whatever comes after that are taking so much time.

Totally 5 million files were created in 3 days but only 0.7 million files got shipped till now. The logstash didn't stop but it processed with much delay.

The fact that you are seeing bytes_read = bytes_unread = 0 with a non-zero file size definitely suggests inode re-use to me.

Pick an inode, such as 1812511751, and search for every reference to it. I suspect you will find it being used for two different files.

If it's inode re-use, what is the suggested configuration?

I do not have a good answer for that.

Thanks anyway!