Filebeat - Log Processing Issues/Delay/Data Loss

Hi,

We are currently experiencing significant challenges with log processing on three of our hosts. Each of these hosts runs nine services, generating between 30,000 to 72,000 events per minute per log file. The server specifications include a 16-core CPU and 62 GB of memory.

We have observed a delay of approximately 20 minutes in log processing. Additionally, after each hour when log files rotate and are renamed, Filebeat does not read data from the previous files. This results in the loss of the last 10 to 20 minutes of log data.

In comparison, our other nine hosts(3 + 9 hosts in one application. Other apps data collection also works well), which have three to five log files configured, are processing data efficiently without any issues.

To diagnose the problem, I executed the command GET _cat/thread_pool/bulk?v, which did not show any rejected requests, indicating that the bulk thread pool is functioning as expected.

A] Filebeat/ELK Version: 8.3.1 (12 Node Platinum License)

B] Event rate per minute.

Service1 - 60k events/min
Service2 - 15k events/min
Service3 - 500 events/min
Service4 - 17k events/min
Service5 - 15.5k events/min
Service6 - 40k events/min
Service7 - 100 events/min
Service8 - 78k events/min
Service9 - 160k events/min

C] Following is a filebeat.yml

filebeat.config.modules:
  enabled: true
  path: ${path.config}/modules.d/*.yml
  reload.enabled: true
  reload.period: 10s

queue.mem:
  events: 20000
  #events: 20000 (tried 8192 as well)
  flush.min_events: 2048

filebeat.registry.flush: 5s

output.logstash:
  hosts: [ "10.40.7.40:5045", "10.40.7.41:5044", "10.40.7.42:5045", "10.40.7.40:5044", "10.40.7.41:5045", "10.40.7.42:5044" ]
  loadbalance: true
  workers: 12
  #workers: 8 ( tried 8 as well)
  bulk_max_size: 20000
  #bulk_max_size: 15000 (tried this as well)
  flush_interval: 1s
  #flush_interval: 500ms (tried this as well)

D] modules.d/app.yml
Note: After each one hour, file gets renamed to "app1-2024-10-16-19-1.log" where 19 is previous hour.

- module: app-1
  microservice:
    enabled: true
    var.paths:
      - "/path/to/app1.log"
    input:
      scan_frequency: 3s  
      close_renamed: false
      close_inactive: 30m
      ignore_older: 24h
      clean_inactive: 48h
      close_timeout: 2h

E] I have 3 logstash hosts which has 12 core cpu and 23 gb memory. I have set pipeline.batch_size to 2048

F] Sample Application log - response string (It prints request/response & things in between of transaction, but format of log is same)

2024-06-13 17:50:20:815 [app4] [app-services] [INFO ] [https-jsse-nio-9443-exec-38] [c7dd483c-2461-4ee5-8d67-bd5076129460] [858f223f1ffe4b8c] [858f223f1ffe4b8c] [ResponseLogFilter:100] - TraceId= [c7dd483c-d5076129460] Timestamp= [1718281219341] ClientId= [abcd] AuthMethod= [JOSE] RequestMethod= [POST] ContentType= [application/jose] AcceptType= [application/jose] RequestUri= [/abcd/efg/create] RequestIp= [0.0.0.0] ResponseBody= [{"error_type":"api_validation_error","error_code":"T3","message":"testing test","status":422}] KeyId= [bdbhsbsa-bsbbsbwh8QaCek] Algorithm= [null] ServerAuthorization= [null] Status= [422] ResponseDate= [2024-06-13T17:50:20+0530] ErrorType= [api_validation_error] ErrorCode= [T3] ErrorMessage= [Testing Testing] Latency= [27]

G] Filebeat log

{"log.level":"info","@timestamp":"2024-10-16T20:23:08.455+0530","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":185},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"cpuacct":{"total":{"ns":299461410874}},"memory":{"mem":{"usage":{"bytes":13615104}}}},"cpu":{"system":{"ticks":1140,"time":{"ms":360}},"total":{"ticks":20510,"time":{"ms":7210},"value":0},"user":{"ticks":19370,"time":{"ms":6850}}},"info":{"ephemeral_id":"9dadf2d8-7ad2-43c5-9726-5e33fbb455c8","uptime":{"ms":90114},"version":"8.3.1"},"memstats":{"gc_next":215038376,"memory_alloc":157361792,"memory_sys":4194304,"memory_total":3315695936,"rss":335130624},"runtime":{"goroutines":116}},"filebeat":{"events":{"active":-4,"added":110716,"done":110720},"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":9},"scans":3},"output":{"events":{"acked":102528,"active":12288,"batches":55,"total":110720},"read":{"bytes":336},"write":{"bytes":22499231}},"pipeline":{"clients":9,"events":{"active":20025,"published":110720,"total":110716},"queue":{"acked":110720}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":13.72,"15":15.51,"5":14.25,"norm":{"1":0.8575,"15":0.9694,"5":0.8906}}}},"ecs.version":"1.6.0"}}

I would appreciate any suggestions or recommendations you may have to help resolve these issues and improve log processing on the affected hosts.

Logstash Logstash Elastic Search filebeat

Thanks.