Filebeat losing logs

devopscanada · June 9, 2021, 9:38pm

I'm running filebeat 7.13.1 in Kubernetes (GKE) with this running configuration:

kubectl exec filebeat-filebeat-2rmhh -- filebeat export config
filebeat:
  inputs:
  - paths:
    - /var/log/containers/*.log*
    processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: /var/log/containers/
    - add_fields:
        fields:
          kubernetes:
            cluster: ${DEPLOYMENT}
        target: ""
    - drop_event:
        when:
          regexp:
            kubernetes:
              container:
                name: node-exporter|metadata-agent|discovery|metrics-server-nanny|sidecar-injector-webhook|mixer|metrics-server|prometheus|prometheus-operator|gke-metrics-agent
    - decode_json_fields:
        add_error_key: true
        fields:
        - message
        max_depth: 1
        overwrite_keys: true
        target: log
    type: container
output:
  logstash:
    hosts:
    - 172.30.0.10:5044
path:
  config: /usr/share/filebeat
  data: /usr/share/filebeat/data
  home: /usr/share/filebeat
  logs: /usr/share/filebeat/logs
setup:
  ilm:
    enabled: false
  template:
    enabled: true
    name: filebeat-${DEPLOYMENT}-%{[agent.version]}
    overwrite: true
    pattern: filebeat-${DEPLOYMENT}-%{[agent.version]}-*
    settings:
      index:
        max_docvalue_fields_search: 200
        number_of_replicas: 1

Everyday I get hundreds of Harvester errors. Today over 400 for ex. The most common error is:

ERROR	log/input.go:519	Harvester could not be started on new file: /var/log/containers/<container_name>.log, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: file info is not identical with opened file. Aborting harvesting and retrying file later again

Where <container_name> is the name of a container that is busy in terms of activity and log generation.

I checked that the file is accesible from filebeat, I get logs with:

kubectl exec filebeat-filebeat-srwk5 -- tail /var/log/containers/$log1.log

I don't know if those errors is the cause of missing logs.

There's no resource issue I can see in filebeat; CPU and memory usage are low and well below its k8s limits. There are no restarting events, no OOMs etc.

fluent-bit is also running in the nodes and it doesn't have this type of problem, so I'm looking into ditching filebeat and just using fluent bit or fluentd, this is one last attempt to try and make it work with filebeat.

The stats from one of the filebeats with errors (there are more than one):

kubectl exec filebeat-filebeat-srwk5 -- curl -s -XGET 'localhost:5066/stats?pretty'
{
  "beat": {
    "cgroup": {
      "cpu": {
        "cfs": {
          "period": {
            "us": 100000
          },
          "quota": {
            "us": 100000
          }
        },
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "stats": {
          "periods": 23264038,
          "throttled": {
            "ns": 221424133666942,
            "periods": 974278
          }
        }
      },
      "cpuacct": {
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "total": {
          "ns": 197977587272386
        }
      },
      "memory": {
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "mem": {
          "limit": {
            "bytes": 1073741824
          },
          "usage": {
            "bytes": 424185856
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 29967790,
        "time": {
          "ms": 29967808
        }
      },
      "total": {
        "ticks": 134287140,
        "time": {
          "ms": 134287161
        },
        "value": 134287140
      },
      "user": {
        "ticks": 104319350,
        "time": {
          "ms": 104319353
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 1048576,
        "soft": 1048576
      },
      "open": 2373
    },
    "info": {
      "ephemeral_id": "66cf83aa-b773-4654-9c46-319fb7f268b8",
      "uptime": {
        "ms": 3183263722
      }
    },
    "memstats": {
      "gc_next": 280547104,
      "memory_alloc": 240890288,
      "memory_sys": 570009656,
      "memory_total": 10514268862096,
      "rss": 432005120
    },
    "runtime": {
      "goroutines": 11857
    }
  },
  "filebeat": {
    "events": {
      "active": 6480,
      "added": 308484099,
      "done": 308477619
    },
    "harvester": {
      "closed": 38603,
      "open_files": 2360,
      "running": 2359,
      "skipped": 0,
      "started": 40962
    },
    "input": {
      "log": {
        "files": {
          "renamed": 0,
          "truncated": 2220
        }
      },
      "netflow": {
        "flows": 0,
        "packets": {
          "dropped": 0,
          "received": 0
        }
      }
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "reloads": 0,
      "scans": 0
    },
    "output": {
      "events": {
        "acked": 307428984,
        "active": 4096,
        "batches": 239937,
        "dropped": 0,
        "duplicates": 0,
        "failed": 34329,
        "toomany": 0,
        "total": 307467409
      },
      "read": {
        "bytes": 4609260,
        "errors": 6
      },
      "type": "logstash",
      "write": {
        "bytes": 96881662917,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 1,
      "events": {
        "active": 4117,
        "dropped": 0,
        "failed": 0,
        "filtered": 1048640,
        "published": 307433100,
        "retry": 47306,
        "total": 308481741
      },
      "queue": {
        "acked": 307428984
      }
    }
  },
  "registrar": {
    "states": {
      "cleanup": 27228,
      "current": 2753,
      "update": 308477619
    },
    "writes": {
      "fail": 0,
      "success": 239970,
      "total": 239970
    }
  },
  "system": {
    "cpu": {
      "cores": 8
    },
    "load": {
      "1": 3.34,
      "15": 2.1,
      "5": 2.47,
      "norm": {
        "1": 0.4175,
        "15": 0.2625,
        "5": 0.3088
      }
    }
  }
}

I see in those metrics some possible issues but still don't know what changes to introduce.

I looked into rotating files less frequently (by waiting fo them to be bigger) but it's not easy in K8s and especially in a managed solution. For completeness, these are the settings at the host level:

# ps -ef | grep dockerd | tr ' ' '\n' | grep log
--log-level=warn
--log-driver=json-file
--log-opt=max-size=10m
--log-opt=max-file=5

Putting a filebeat in debug mode is not straight-forward in K8s where they are deployed as DaemonSet (they are all the same); I'd have to make a node special, taint it for filebeat, deploy a separate filebeat in it, make sure a chatty pod is in it etc.

Thanks

system · July 7, 2021, 11:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat adding incorrect kubernetes metadata to events going to logstash Beats filebeat	5	814	April 2, 2024
Filebeat unable obtain kubernetes metadata Beats docker , filebeat	2	896	September 23, 2020
Kubernetes - Filebeat stops sending/picking up logs Beats filebeat	17	6458	June 26, 2018
Filebeat makes too many API calls choking and bringing down the Kubernetes cluster Beats filebeat	3	575	February 8, 2022
Filebeats Getting Logs from Container in K8s Beats	3	1299	June 6, 2019

Filebeat losing logs

Related topics