Filebeat lost data

Hi all,

We have experienced a lost data issue with filebeat > logstash.

Our Logstash server was down for 4 hours and some events have not been indexed.

The filebeat configuration is as follows :

scan_frequency: 30s
ignore_older: 10m
close_eof: true
close_removed: true
clean_removed: true
clean_inactive: 15m

The indexed files are read only once because they are never updated once created.

I have identified some files that have not been indexed but I do not see them in filebeat logs (whereas I can see all other files).

What is the exact retry policy of filebeat ?

Can our configuration lead to data loss ?

Thank you.

filebeat uses send-at-least once semantics. Have you check the files state in the registry?

A file might not be picked up if it has been deleted via log-rotation and filebeat is restart/started thereafter, or filebeat has closed a file and can not pick it up anymore, due to ignore_older.

What do you mean by send-at-least once sementics ?

The "missing" files are not in registry.
It seems that filebeat did not see these files, no trace in log file nor in the registry.

Could it be due to broken communication with logstash ?

What do you mean by send-at-least once sementics ?

That is, on failure (e.g. missing ACK from LS), filebeat will retry -> harvesters will be blocked, due to buffers in filebeat being filled up.

It seems that filebeat did not see these files, no trace in log file nor in the registry.

Could it be due to broken communication with logstash ?

Maybe you want to share the full configuration, logs, registry file with us? Given the information I have so far, I'd assume it's due to ignore_older plus clean_inactive. The clean_inactive removes entries from the registry. Due to ignore_older, these old files are not picked up again...

What version of filebeat are we talking about?

So when filebeat is in retry mode, harvesters end to be blocked due to buffer size.

And if this situation lasts long (longer than clean_inactive, which should be longer than ignore_older), new files can be simply ignored.

Is that a good summary of the situation?

@ruflin I am using filebeat 5.4.1

Not sure it's that simple, as filebeat behavior also depends on some other settings in your configuration file. Please share filebeat configuration.

Here is the configuration, but there is nothing more than what I have previously shared :

filebeat.prospectors:

- input_type: log
  paths:
    - D:\_data\aaa\bbb\*.csv
  encoding: utf-8
  document_type: foo
  scan_frequency: 30s
  ignore_older: 10m
  close_eof: true
  close_removed: true
  clean_removed: true
  clean_inactive: 15m

output.logstash:
  hosts: ["logstash-val:5044"]

logging.level: info
logging.to_files: true
logging.files:
  path: D:\logs
  name: filebeat
  rotateeverybytes: 10485760 # = 10MB
  keepfiles: 10

Could you try to set close_removed to false?

The problem is I cannot easily test this scenario again.
It occurred on a production environment while our logstash server was down for patching activity.
It is a hard task to reproduce the environment in validation stage.
Why do you think that close_removed can change something ? Missing files have not been removed.
I think @steffens explanations are correct. Harvesters have been blocked for too long time and as ignore_older is quite short (10'), new files have been ignored.

Sorry for the really late reply this somehow slipped through the cracks. You are right in case the files are not removed close_removed would not have an affect. As you haven't set a harvester_limit I would expect the harvester still pick up the new files preventing from applying ignore_older. I wonder now if close_inactivecould apply, close the file and the clean_inactive can happen.

I would be really interesting to see the log file when this happens as this would allow to see the steps and logics that filebeat applied.

No problem ! Thank you for answering.
As I said, it is not an easy thing to simulate this behavior again. There is no patching activity scheduled on our logstash server (and I cannot stop it ! :wink:) so just have to wait...
I have change the ignore_older config to 2h as I think this parameter comes into play in this issue. But maybe I'm wrong...

If the issue happens again, I will send you the logs !

@oguachard Thanks a lot, appreciate it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.