Data loss prevention?

Hi,

Unfortunately I am have issues with my platform. This means sometimes my beats cannot send data to my logstash/elasticsearch anymore. They keep trying of course, but this may take an hour or even 2. We are searching for the cause of this, but in the mean time also have another issue.

If you look at this:


you see the nr.of reporting servers every minute. Easy way for me to see if all hosts are sending data.
You also see a big gap.

Now I am wondering, where is my data going if filebeat cannot send (I see errors in the log): I thought after reconnection, filebeat would send the data anyway, but it seems like I keep having that gap.
Here's the default filebeat config I am using:

filebeat.inputs:
- type: log
  enabled: true

  paths:
    - /var/log/server/server.log

  exclude_files: ['\.gz$']

  multiline.pattern: '^ts:'
  multiline.negate: true
  multiline.match: after

  tags: [ "api-log", "apigateway", "asd"]

  ignore_older: 6h
  close_inactive: 5m
  close_removed: true
  clean_removed: true
  clean_inactive: 12h
  scan_frequency: 30s
  harvester_limit: 0

filebeat.config.modules:
  enabled: false

processors:
  - drop_fields:
      fields: ["host"]

fields:
  environment: production

queue.mem:
  events: 4096

output.logstash:
  enabled: true
  hosts: ["server1:5044","server2:5044","server3:5044","server4:5044"]

  loadbalance: true
  timeout: 1m
  slow_start: true
  worker: 4
  bulk_max_size: 4096

logging:
  level: info
  to_files: true
  to_syslog: false
  files:
    path: '/var/log/filebeat'
    name: 'filebeat'
    keepfiles: '3'
    permissions: '0644'
  metrics:
    enabled: false

Anyone knows what I am doing wrong?

I'd suggest you read: https://www.elastic.co/guide/en/beats/filebeat/current/configuring-internal-queue.html. Your configuration only stores the last 4096 events before Filebeat starts dropping events that it can't send. This is why you are probably seeing event/data loss.

What errors?

failed to publish because of connection reset

uhm... I cannot set this to millions I guess?

If you want high retention. I'd suggest using the disk queue instead of memory queue, as it will allow for greater local data retention. (It's in beta, so it's subject to change).

Ehmm... But I guess this will be much and much slower. So If I would liketo have a diskqueue AND performance I suppose I should use a RAMdisk?

I personally try to avoid ramdisks as they can be weird in production environments, generally unless you're generating hundreds or thousands of events per second, even a HDD should be sufficient for using disk queue without much negative impact. (I've never done any benchmarking, just going off of past experience)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.