Filebeat 5.0 doesn't work for large files

Hi,

I'm trying Filebeat 5.0 with Logstash 2.4. Here is my configuration.

filebeat.prospectors:
- input_type: log
  paths:
    - /home/spark/*.log
  ignore_older: 24h
  clean_inactive: 25h
  multiline.pattern: '^\[0-9]'
  multiline.negate: true
  multiline.match: after
  multiline.max_lines: 1000000
  multiline.timeout: 5m
output.logstash:
  hosts: ["localhost:5044"]
logging.files:
  keepfiles: 100

I found it works fine with the new coming files, but for a old file which is created two days before, the file size is about 30M and 300000 lines, though it's keep writing now, filebeat can't handle this file. Here is part of beginning of logs and keep showing the same lines.

2016-11-23T08:59:27Z DBG  Check file for harvesting: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:27Z DBG  Start harvester for new file: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:27Z DBG  Setting offset for file based on seek: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:27Z DBG  Setting offset for file: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr. Offset: 0 
2016-11-23T08:59:27Z DBG  New state added for /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:27Z DBG  Prospector states cleaned up. Before: 1, After: 1
2016-11-23T08:59:27Z INFO Harvester started for file: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:28Z DBG  End of file reached: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr; Backoff now.
2016-11-23T08:59:29Z DBG  End of file reached: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr; Backoff now.
2016-11-23T08:59:31Z DBG  End of file reached: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr; Backoff now.
2016-11-23T08:59:32Z DBG  Flushing spooler because of timeout. Events flushed: 1
2016-11-23T08:59:32Z DBG  No events to publish
2016-11-23T08:59:32Z DBG  Events sent: 1
2016-11-23T08:59:32Z DBG  Processing 1 events
2016-11-23T08:59:32Z DBG  New state added for /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:32Z DBG  Registrar states cleaned up. Before: 1 , After: 1
2016-11-23T08:59:32Z DBG  Write registry file: /var/lib/filebeat/registry
2016-11-23T08:59:32Z DBG  Registry file updated. 1 states written.
2016-11-23T08:59:35Z DBG  End of file reached: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr; Backoff now.
2016-11-23T08:59:37Z DBG  Flushing spooler because of timeout. Events flushed: 0
2016-11-23T08:59:37Z DBG  Run prospector
2016-11-23T08:59:37Z DBG  Start next scan
2016-11-23T08:59:37Z DBG  Check file for harvesting: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:37Z DBG  Update existing file for harvesting: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr, offset: 0
2016-11-23T08:59:37Z DBG  Harvester for file is still running: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr
2016-11-23T08:59:37Z DBG  Prospector states cleaned up. Before: 1, After: 1
2016-11-23T08:59:42Z DBG  Flushing spooler because of timeout. Events flushed: 0
2016-11-23T08:59:43Z DBG  End of file reached: /mnt/resource/hadoop/yarn/log/application_1479342877823_1077/container_e51_1479342877823_1077_01_000002/stderr; Backoff now.
2016-11-23T08:59:47Z DBG  Run prospector

The offset in registry file is always 0. I know it's larger than the max_bytes which is 10Mib, but under this condition, filebeat isn't able to throw this event and keep handing the new lines? How should I handle this condition with filebeat?

Update. If I don't use multiline in filebeat, it works to handle large files, but if I use multiline, it stops to handle large files through it keeps tracking the files in the logs. Did I miss some settings for multiline? Anyone have sense about this problem? Thanks.

Could you share the log file with us so we can try to reproduce this on our side? You mentioned max_bytes: Is a single multiline event larger the max_bytes or only the log file itself?

Sorry for late reply.

After did some experiences, I found the real problem is not the file size, but the configuration of multiline.

As my understanding, the multiline event will be sent when it has larger lines than multiline.max_lines or it's size larger than multiline.max_bytes or there's no new lines come within multiline.time_out.

In my condition, I have some files that already exists before filebeat service starts and still keep writing, if I set the three multiline thresholds relatively large, it won't send message after a long time, which can't achieve my ETA. But if I set the thresholds small, it can improve the send interval, but at the first time the harvester send event, it will loss data, such as file bytes are already larger than multiline.max_bytes, it will only send the multiline.max_bytes defined bytes and update the offset to the total size of the file it discovers.

Is it possible to both achieve ETA and has no data loss under my condition? Or filebeat can't support my scenario now?

Not sure I 100% understand the problem. Is the problem that some of the multiline events that are written by your application are not written at once but over a longer period? Can you provide a heavily simplified example of such an event?

Actually, we're using filebeat and logstash to upload spark logs. Some applications may write very fast and keep writing to the log files. Filebeat multiline is not that suitable for this kind of streaming scenario, it will never flush as the timeout may not achieved event time_out value is set to small value. Thus we decided to handle multiline issue in logstash output side. Really thanks for your kindly reply.

Thanks for the details. I'm still trying to get my head around this. So Spark is not writing full multiline events at once into the log file but adds only part of it. Or is it the other way around that spark is only updating the log file from time to time with large amount of lines but does not draw a clear border between the ends of multiline events? The reason I keep asking is that I would like to understand if there is something we could improve in Filebeat so we could cover this event. Perhaps a link to the logging driver docs you are using could also be helpful. Thx.

For spark streaming application, the application keeps writing new lines to log files, it may write several lines a time. The problem here is it also has java stack trace information in the logs. At beginning, we want to leverage the multiline feature of filebeat, but it doesn't have a flush interval that we can config, rather than using a time out which is quite different for each application. Spark applications are using log4j for logging, nothing special in spark logs.

I have one more question, when will logstash send ACKs to filebeat, as soon as it receives the event or after the event complete after output thread?

It seems like your log4j logging is quite standard and should work with the filebeat multiline feature. A multiline event is sent as soon as the next multiline event is detected or the end of the pattern is detected. Depending on your config options. Timeouts only come into play for the last event in a log file when there are not more log events added. So if you have log lines flowing in constantly, multiline events should be constantly sent off. Perhaps I still miss something :wink: In case the timeouts are different for the log files at the end, you can use multiple prospectors with different timeouts.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.