Filebeat stops proccessing s3 input with no error

Using filebeat 7.4.2 and stack monitoring we see where the throughput of filebeat drops to zero and the only abnormal log message is:

2019-12-20T03:47:31.711Z INFO [s3] s3/input.go:293 Message visibility timeout updated to 300

2019-12-20T03:47:39.581Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":5770,"time":{"ms":16}},"total":{"ticks":108770,"time":{"ms":28},"value":108770},"user":{"ticks":103000,"time":{"ms":12}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":8},"info":{"ephemeral_id":"e868953c-5e67-4695-ac95-0ffd16ed8822","uptime":{"ms":8280035}},"memstats":{"gc_next":42940960,"memory_alloc":22161072,"memory_total":16862697496},"runtime":{"goroutines":40}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.28,"15":0.84,"5":0.63,"norm":{"1":0.14,"15":0.42,"5":0.315}}}}}}

2019-12-20T03:48:09.581Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":5780,"time":{"ms":8}},"total":{"ticks":108780,"time":{"ms":12},"value":108780},"user":{"ticks":103000,"time":{"ms":4}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"e868953c-5e67-4695-ac95-0ffd16ed8822","uptime":{"ms":8310035}},"memstats":{"gc_next":42940960,"memory_alloc":23817144,"memory_total":16864353568},"runtime":{"goroutines":38}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.17,"15":0.81,"5":0.57,"norm":{"1":0.085,"15":0.405,"5":0.285}}}}}}

The Message visibility timeout updated to 300 log line does not appear in our other filebeat instances that are running correctly processing the same s3 bucket. This seemingly happens random from what I can tell and a simple restart of the process will get it processing events again.

Configuration:

    filebeat.inputs:
    - type: s3
      queue_url: https://sqs.us-east-1.amazonaws.com/xxxxxxx/queue
      processors:
      - decode_json_fields:
          fields: ["message"]
          process_array: true
          max_depth: 1
          target: ""
          overwrite_keys: true
      - rename:
          fields:
          - from: "log_level"
            to: "level"
          ignore_missing: true
    processors:
    - drop_fields:
        fields: ["agent.ephemeral_id", "agent.hostname", "agent.id", "agent.type", "agent.version", "host.name", "ecs.version", "input.type", "log.file.path", "log.offset", "stream", "suricata.eve.timestamp", "time", "timestamp", "kubernetes.container.image", "kubernetes.labels.pod-template-hash", "container.labels", "container.name", "container.image", "kubernetes.labels.ci_branch", "kubernetes.labels.ci_build_number", "kubernetes.labels.ci_dep_build_number", "kubernetes.labels.ci_dep_user", "kubernetes.labels.ci_user", "kubernetes.labels.generator", "kubernetes.labels.updated", "aws.s3.bucket.arn", "log.file.path"]
        ignore_missing: true
    cloud.id: "xxxxxxxxxx"
    cloud.auth: "xxxxxxxxx"
    output.elasticsearch:
      compression_level: 5
      worker: 10
      bulk_max_size: 4096
    monitoring:
      enabled: true
      cluster_uuid: xxxxxx
      elasticsearch:
        hosts: ["https://xxxxxxxxx.us-east-1.aws.found.io:9243"] 
        username: xxxxx
        password: xxxxx

Edit: Was able to find one instance that finally did output an error:

2019-12-20T12:03:09.533Z ERROR [s3] s3/input.go:291 change message visibility failed: InvalidParameterValue: Value AQEBK/v0lshsHIKTA2NI3HPr2EgFD8LQ73QAIalGrhp8q+l2LGJBxjbLivyVuytp8hbdXn+hXGp+b7WeBQEZiiKF3arYK/CToGiDN+ZgDBosW+YJhHetQu8pU4/rSkhpunsP1ywG1RvZ4O8IoKhqqjOHYsPZul2+ntjDNpOf5u16MA6ZbNUY0+ZDaxH6V61XcrYKsrUuIGBTMqp3EmRkROWgacE6XYYNTCuP0m1LfvRsAMHvKAUPu5O4fiyA1Wf+61EGVP6FDOc9+4Zk/rRt/IZh/y4YLMW3l7NyUFdVt6hjwovevHfEzlI9POWshBVZ3brrmHfaq08Lga1/M/auTM4ENImVH9H7cCQr36a/tlqJTemJwG8pCSJNzye8fZ2oO13UL6zGB+BaBDR4YcoNA+HRlrjBDegIBkDPkvqcM6Afk7g= for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.

status code: 400, request id: 12500646-eb3b-5276-a0a3-503051863927

2019-12-20T12:03:09.533Z INFO [s3] s3/input.go:293 Message visibility timeout updated to 300

2019-12-20T12:03:23.548Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":354180},"total":{"ticks":1403390,"time":{"ms":12},"value":1403390},"user":{"ticks":1049210,"time":{"ms":12}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":7},"info":{"ephemeral_id":"3f765775-ef09-448b-8c66-c6315cee5030","uptime":{"ms":1974180053}},"memstats":{"gc_next":10904960,"memory_alloc":6704520,"memory_total":118881164656,"rss":-81920},"runtime":{"goroutines":22}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.14,"15":0.36,"5":0.29,"norm":{"1":0.07,"15":0.18,"5":0.145}}}}}}

2019-12-20T12:03:53.548Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":354180,"time":{"ms":4}},"total":{"ticks":1403400,"time":{"ms":12},"value":1403400},"user":{"ticks":1049220,"time":{"ms":8}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":6},"info":{"ephemeral_id":"3f765775-ef09-448b-8c66-c6315cee5030","uptime":{"ms":1974210054}},"memstats":{"gc_next":10885360,"memory_alloc":5532072,"memory_total":118881547792},"runtime":{"goroutines":20}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.3,"15":0.37,"5":0.31,"norm":{"1":0.15,"15":0.185,"5":0.155}}}}}}

2019-12-20T12:04:23.549Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":354180},"total":{"ticks":1403400,"time":{"ms":4},"value":1403400},"user":{"ticks":1049220,"time":{"ms":4}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":6},"info":{"ephemeral_id":"3f765775-ef09-448b-8c66-c6315cee5030","uptime":{"ms":1974240054}},"memstats":{"gc_next":10885360,"memory_alloc":5864024,"memory_total":118881879744},"runtime":{"goroutines":20}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.18,"15":0.35,"5":0.28,"norm":{"1":0.09,"15":0.175,"5":0.14}}}}}}

2019-12-20T12:04:53.548Z INFO [monitoring] log/log.go:145 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":354180},"total":{"ticks":1403410,"time":{"ms":4},"value":1403410},"user":{"ticks":1049230,"time":{"ms":4}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":6},"info":{"ephemeral_id":"3f765775-ef09-448b-8c66-c6315cee5030","uptime":{"ms":1974270053}},"memstats":{"gc_next":10885360,"memory_alloc":6078752,"memory_total":118882094472},"runtime":{"goroutines":20}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"pipeline":{"clients":1,"events":{"active":0}}},"registrar":{"states":{"current":0}},"system":{"load":{"1":0.44,"15":0.37,"5":0.33,"norm":{"1":0.22,"15":0.185,"5":0.165}}}}}}

Thanks for posting your question here!

The only two conditions when filebeat tries to update visibility timeout:

  1. message processing failed
  2. it's taking too long to process the message from sqs

I will look into it and try to reproduce. In the mean time, could you by any chance to try run it with the latest version of Filebeat(7.5.1)? I believe we had some bug fixes in 7.5 but not cherry picked into 7.4. Also if you can change logs to Debug level, that will help. Thank you!

@Kaiyan_Sheng so we cannot update to 7.5.x because of this issue we also opened:

We do have one instance that is running 7.5.1 and doing gzip with AWS vpc flow logs fine and has not ran into this issue. However, it also has less load than the filebeat instances that do our normal logs. So, if we could get the above issue from github tackled then we could see if the issue persists in 7.5.1.

Looking at the release notes for 7.5.0 this appears to be the only relevant ticket:


And from the phrasing it appears to be a hard crash, which I am not seeing in the logs. However, I will also switch to debug level and report back.

I just created a PR to fix the gzip issue. The bug fix related to log error message I mentioned above is https://github.com/elastic/beats/pull/14113/files#diff-ede3f69796ae7b221405174f187a241dR260 Without this change, error message won't show up in the log hmmm

Do you know which log file is having issue here? If you can give me an example of what the log file looks like, that would help a lot. Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.