Losing logs in ELK

  • Setup: Kubernetes cluster with filebeat-7.12.0 -> logstash-7.9. -> elasticsearch-7.9.1 -> kibana-7.9.1

  • Problem: we lose many logs often and sometimes in a massive way, as in we stop seeing logs from one or more pods (confirmed fact from other logging/alert mechanisms), we don't know where/how. There is no particular event that we can relate to logs from some k8s pods not starting to appear in ES; it's not tied to a particular server of particular type of pod or anything we can isolate or reproduce.

  • Settings and resource utilization:

    • server-side: 3 ES nodes (in all master/ingest/data roles) and 3 Logstash pods on 3 dedicated servers (4 CPU, 16 GB RAM).
    • ES JVM: "-Xmx5g -Xms5g". Memory limit at 9Gi and no limit on CPU.
    • ES: CPU usage around 30%, disk usage less than 70% of a total 3.5Tx3 disks. Documents: ~3, 6, 9 billion. Heap usage between 2.3 - 3.7 GiB. "active_primary_shards" : 151, "active_shards" : 426.
    • Logstash: uses 0.02-0.4 CPU and 1 - 1.5 GB and its limits are 1CPU and 2GB.
    • Client-side: filebeat with no resource limits running on 20-25 servers and 2-10 pods on each host. No capacity issues detected (k8s CPU throttling, OOM, restarts etc). Each filebeat uses about 0.15 CPUs and ~ 330 MiB.
    • We generate about 30M logs per day.
  • Nothing I can see indicates a direct capacity/resource issue in ES (eg GET /_nodes/hot_threads etc) or in any other component.

  • I don't know if the issue is in filebeat not sending logs, logstash dropping them or somehow in ES. Looking at error logs:

    • Filebeat shows about 200 - 500 errors a day of these types:

      Harvester could not be started on new file: ... Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: file info is not identical with opened file. Aborting harvesting and retrying file later again
      Harvester could not be started on new file: ... Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: failed to stat source file
      [logstash]	logstash/async.go:280	Failed to publish events caused by: client is not connected
    • Logstash only shows one type of error about expecting the log to be json but getting a concrete value, this is one application sending a bad log. Incidentally, this is hard to debug because logstash doesn't tell you what the source log is (not even the filebeat instance) , so there's no way to know from this error where the issue is coming from. In any case, unrelated to missing logs.

    • ES errors: they are about search errors.

  • I've search all the errors and I haven't found anything actionable or anything that matches our situation. A weak hypothesis is that if we get a lot of quick log rotations, we may lose some logs because k8s moves rotated logs outside of reach of filebeat. Still this wouldn't explain a massive number of logs missing.

  • As for the Logstash dropping logs, not sure how to confirm one way or another; I don't have errors about that and querying the stats I get almost the same amount of events in/filtered/out and if it was dropping logs I'd expect the numbers for in and out to be different.

These seem pretty relevant.
Can you enable debug on one Filebeat pod and provide more logging?

Thanks (and sorry probably this would be more for the filebeat forum), I can try putting one filebeat in debug mode, although in K8s this is a bit of a PITA since they are DaemonSets (the same filebeat config running in each node), so I'll have to do a workaround like using taints in one node so that filebeat DaemonSet doesn't put a pod in it and separately and manually putting a filebeat pod in this host with debugging enabled.

I do have the filebeat pods with the stats endpoint enabled, so I can manually look at metrics like for ex for harverster in one pod:

kubectl exec filebeat-filebeat-fslqv -- curl -s -XGET 'localhost:5066/stats?pretty' |jq .filebeat.harvester
  "closed": 3,
  "open_files": 10,
  "running": 10,
  "skipped": 0,
  "started": 13

What are the guidelines for those metrics to be "good"? I imagine something like:

open_files = running
running + closed = started

Or what other metrics can I focus on that would indicate a problem (specifically with missing files/logs), like maybe registrar.write.fail > 0 , pipeline.events.failed > 0 or output.read|write.errors > 0 ?