Filebeat losing logs

Hello, see this question for context.

I'm running filebeat 7.13.1 in Kubernetes (GKE) with this running configuration:

kubectl exec filebeat-filebeat-2rmhh -- filebeat export config
filebeat:
  inputs:
  - paths:
    - /var/log/containers/*.log*
    processors:
    - add_kubernetes_metadata:
        host: ${NODE_NAME}
        matchers:
        - logs_path:
            logs_path: /var/log/containers/
    - add_fields:
        fields:
          kubernetes:
            cluster: ${DEPLOYMENT}
        target: ""
    - drop_event:
        when:
          regexp:
            kubernetes:
              container:
                name: node-exporter|metadata-agent|discovery|metrics-server-nanny|sidecar-injector-webhook|mixer|metrics-server|prometheus|prometheus-operator|gke-metrics-agent
    - decode_json_fields:
        add_error_key: true
        fields:
        - message
        max_depth: 1
        overwrite_keys: true
        target: log
    type: container
output:
  logstash:
    hosts:
    - 172.30.0.10:5044
path:
  config: /usr/share/filebeat
  data: /usr/share/filebeat/data
  home: /usr/share/filebeat
  logs: /usr/share/filebeat/logs
setup:
  ilm:
    enabled: false
  template:
    enabled: true
    name: filebeat-${DEPLOYMENT}-%{[agent.version]}
    overwrite: true
    pattern: filebeat-${DEPLOYMENT}-%{[agent.version]}-*
    settings:
      index:
        max_docvalue_fields_search: 200
        number_of_replicas: 1

Everyday I get hundreds of Harvester errors. Today over 400 for ex. The most common error is:

ERROR	log/input.go:519	Harvester could not be started on new file: /var/log/containers/<container_name>.log, Err: error setting up harvester: Harvester setup failed. Unexpected file opening error: file info is not identical with opened file. Aborting harvesting and retrying file later again

Where <container_name> is the name of a container that is busy in terms of activity and log generation.

I checked that the file is accesible from filebeat, I get logs with:

kubectl exec filebeat-filebeat-srwk5 -- tail /var/log/containers/$log1.log

I don't know if those errors is the cause of missing logs.

There's no resource issue I can see in filebeat; CPU and memory usage are low and well below its k8s limits. There are no restarting events, no OOMs etc.

fluent-bit is also running in the nodes and it doesn't have this type of problem, so I'm looking into ditching filebeat and just using fluent bit or fluentd, this is one last attempt to try and make it work with filebeat.

The stats from one of the filebeats with errors (there are more than one):

kubectl exec filebeat-filebeat-srwk5 -- curl -s -XGET 'localhost:5066/stats?pretty'
{
  "beat": {
    "cgroup": {
      "cpu": {
        "cfs": {
          "period": {
            "us": 100000
          },
          "quota": {
            "us": 100000
          }
        },
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "stats": {
          "periods": 23264038,
          "throttled": {
            "ns": 221424133666942,
            "periods": 974278
          }
        }
      },
      "cpuacct": {
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "total": {
          "ns": 197977587272386
        }
      },
      "memory": {
        "id": "4f3d4a0e7fa51558a92d9ff8fa6787b6167dfc2c495c9eb65ba87558eb7136df",
        "mem": {
          "limit": {
            "bytes": 1073741824
          },
          "usage": {
            "bytes": 424185856
          }
        }
      }
    },
    "cpu": {
      "system": {
        "ticks": 29967790,
        "time": {
          "ms": 29967808
        }
      },
      "total": {
        "ticks": 134287140,
        "time": {
          "ms": 134287161
        },
        "value": 134287140
      },
      "user": {
        "ticks": 104319350,
        "time": {
          "ms": 104319353
        }
      }
    },
    "handles": {
      "limit": {
        "hard": 1048576,
        "soft": 1048576
      },
      "open": 2373
    },
    "info": {
      "ephemeral_id": "66cf83aa-b773-4654-9c46-319fb7f268b8",
      "uptime": {
        "ms": 3183263722
      }
    },
    "memstats": {
      "gc_next": 280547104,
      "memory_alloc": 240890288,
      "memory_sys": 570009656,
      "memory_total": 10514268862096,
      "rss": 432005120
    },
    "runtime": {
      "goroutines": 11857
    }
  },
  "filebeat": {
    "events": {
      "active": 6480,
      "added": 308484099,
      "done": 308477619
    },
    "harvester": {
      "closed": 38603,
      "open_files": 2360,
      "running": 2359,
      "skipped": 0,
      "started": 40962
    },
    "input": {
      "log": {
        "files": {
          "renamed": 0,
          "truncated": 2220
        }
      },
      "netflow": {
        "flows": 0,
        "packets": {
          "dropped": 0,
          "received": 0
        }
      }
    }
  },
  "libbeat": {
    "config": {
      "module": {
        "running": 0,
        "starts": 0,
        "stops": 0
      },
      "reloads": 0,
      "scans": 0
    },
    "output": {
      "events": {
        "acked": 307428984,
        "active": 4096,
        "batches": 239937,
        "dropped": 0,
        "duplicates": 0,
        "failed": 34329,
        "toomany": 0,
        "total": 307467409
      },
      "read": {
        "bytes": 4609260,
        "errors": 6
      },
      "type": "logstash",
      "write": {
        "bytes": 96881662917,
        "errors": 0
      }
    },
    "pipeline": {
      "clients": 1,
      "events": {
        "active": 4117,
        "dropped": 0,
        "failed": 0,
        "filtered": 1048640,
        "published": 307433100,
        "retry": 47306,
        "total": 308481741
      },
      "queue": {
        "acked": 307428984
      }
    }
  },
  "registrar": {
    "states": {
      "cleanup": 27228,
      "current": 2753,
      "update": 308477619
    },
    "writes": {
      "fail": 0,
      "success": 239970,
      "total": 239970
    }
  },
  "system": {
    "cpu": {
      "cores": 8
    },
    "load": {
      "1": 3.34,
      "15": 2.1,
      "5": 2.47,
      "norm": {
        "1": 0.4175,
        "15": 0.2625,
        "5": 0.3088
      }
    }
  }
}

I see in those metrics some possible issues but still don't know what changes to introduce.

I looked into rotating files less frequently (by waiting fo them to be bigger) but it's not easy in K8s and especially in a managed solution. For completeness, these are the settings at the host level:

# ps -ef | grep dockerd | tr ' ' '\n' | grep log
--log-level=warn
--log-driver=json-file
--log-opt=max-size=10m
--log-opt=max-file=5

Putting a filebeat in debug mode is not straight-forward in K8s where they are deployed as DaemonSet (they are all the same); I'd have to make a node special, taint it for filebeat, deploy a separate filebeat in it, make sure a chatty pod is in it etc.

Thanks

2 Likes