Filebeat 5.0 with multiline, split event data to two events

Hi,

I am testing to use Filebeat against direct Ingest Node.
The filebeat reads multiline events.
Sometimes there is a cut in an event (A Java stack trace), which splits into two events.
The first part, will suit the Grok pattern, while the second part will not suit it, and be directed to a special index for failed events.

The Multiline feature is not suppose to let it happen, since there is no data flow problem, the filebeat reads it all.
Can it be related to Buffer's or something like that ?
How can I make sure that an event will not be splited ?

Thanks,

Ori

Can you provide the Filebeat config that you are using to collect the multiline events and a sample stack trace that was split into two events.

The only setting I can think of that would cause a split is the timeout. This might be occurring if the app is buffering the log output before flushing. Is the log file on a local disk (no network mounts)?

Hi Andre,

Here is the filebeat config:

filebeat:
  registry_file: /opt/sw/filebeat/registry/qa-app01/registry_applogs
  prospectors:
    # AppLogs
    -
      paths:
        - /mnt/qa-app01-logs/AppLogs/RT*.log*
      input_type: log
      multiline:
        pattern: ^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[[:space:]][[:digit:]]{1,2},[[:space:]][[:digit:]]{4}[[:space:]][[:digit:]]{1,2}:[[:digit:]]{2}:[[:digit:]]{2}[[:space:]](AM|PM)
        negate: true
        match: after
      exclude_lines: [ ^\* ]
      ignore_older: 2h
      close_inactive: 5m
      document_type: applogs
      harvester_buffer_size: 16384
      encoding: plain
      fields_under_root: true
      fields:
        app_env: E1
        type_id: 110
        time_to_keep: 43200
        hostname: QA-APP01

logging:
  to_files: true
  to_syslog: false
  files:
    path: /opt/sw/filebeat/logs/qa-app01
    name: filebeat_applogs.log
    rotateeverybytes: 2097152
    keepfiles: 10

output:
  elasticsearch:
   hosts: ["qaelk06:9200"]
   index: "application-logs"
   ## bulk_max_size: 50, Default
   bulk_max_size: 10000
   ## flush_interval, Default unknown
   flush_interval: 60
   parameters.pipeline: "app-applogs-pipeline"

I have a hard time to find which of the event was the one splitted....
The logs are on a network mount, Filebeat is installed on a Linux machine, the logs are on a windows machine.
Accessing them using CIFS protocol.

We have 22 different logs types, at first the filebeat was installed on the windows servers, but it causes a high CPU Usage, so we decided to test it on a dedicated server.
We are testing it using the new Ingest mechanism of the ElasticSearch Version 5.

Thanks,

Ori

One more thing, I do not think it is a timeout.

If it was a timeout, then I would expect to see many small events for each unrelated line from the stack trace,
but I see some lines together as an event, so it means that they were grouped together, but the whole event got splitted.

Ori

Good point.

So how about adding a file output so you can inspect the events generated by Filebeat and enabling debug logging. Then next time you get a stack trace event published to ES that is split across two events you can check what the event looked like in the file output and find the logs that correspond to when the stack trace was processed.

Hi,

Still did not reproduce.....

Can it be related to:

Maybe the stack trace is being written partially, and by the timeout, the Filebeat sends a first part of it, and when the stack trace finished the writing, and Filebeat sends the rest????? Maybe Filebeat knows it is in the middle of a multiline event, so it sends all the lines together????

Ori

OK, I made a little test, and it does seem related!!!!

I have manage to reproduce it.
The filebeat just takes the rest of the lines of the event and combines it together to a single event.
So it is not as I thought will be, many single lines, but a whole event with the rest of the lines, but the Ingest cannot handle them, since the begining of the event is missing, it was sent when the timeout occurred.

What will be the best option not to split the event ?
Should I just set the TIMEOUT to -1 ?
It should affect only the current Harvester, right ?

Ori

You would have to set timeout to a large value. I'm kind of surprised to see that a multiline event doesn't get flushed in one go. We even had the discussion internally that a Timeout is not really needed and only for the EOL flush.

What kind of system is creating this log file?

A Managed Server under Weblogic.
This is a Java application.

Ori

I wonder if the network mount could cause this issue.

I have managed to reproduce it, on my desktop.
Please review my previous comments for this thread, as it seems like not all the stack trace is being written, and the timeout occurs.

Ori

Partial events being written normally happens due to buffering effects. E.g. if one is using a BufferedWriter in Java , the buffer will only be flushed once the buffer becomes full or the writer is closed (no buffer timeout). This is quite common behavior in many output implementations using buffering for high increasing throughput. This unfortunately increases the latencies in the logging case. If no more events are logged, events might partially being stuck in buffer for undetermined amount of time (or might get lost if application crashes).

Personally I consider this to be an issue in the application writing the logs. If something bad happens I'd prefer my logs to be flushed more 'nowish'.

Option is to set timeout to a very large value or to -1 (disable timeout). Problem with disabling timeout is, if last event is multiline event we can not tell if multiline event is finished until either another event is received or file is closed (again increasing latencies).

Hi Steffens,

Understood!!!
I have raised it to be 10s, instead of default 5s.
Is there a good way to overcome this in the filebeat itself ?
Maybe if the line is not currently part of a multiline event, then ignore it ?
Filebeat has conditions, are there any flags which can indicate it (If we are in the middle of a multiline event, or waiting for the next line which match pattern) ?

Thanks,

Ori

I think there is currently not good approach to overcome this in filebeat as it really depends on the application writing the log files. One possible approach is to have a start and end pattern for multiline, but that will also not really work for most multiline java logs I think.

What do you mean by "not part of multiple event"?

Currently there are no such flags which are published. One flag we are working on can be found here: https://github.com/elastic/beats/pull/2279 But that only list the number of lines. An additional flag could be added in case an event was flushed because of timeout?

There is currently no way of working around this in filebeat and in general. In general you need a lookahead of 1 to figure out you're processing a multiline event or not. Even if filebeat were to operate on bytes level instead of lines, for java stack-traces it would be the same 'parsing'-problem, about missing the next symbol deciding if multiline-parse is complete or not.

Adding a start pattern condition to filebeat multiline support might help a little, as we can use the start pattern condition to disable mulitline (publish all non-multiline events immediately) temporarily. But once you're into a multiline event, delays due to buffering will occur and there is no good workaround. Having an end-pattern would require you to modify the output of stack-traces a little (which needs control over the application) for being matchable.

Talking about start/end patterns. Assuming you have control over your application + filebeat, but don't want to modify the logging libs/strategies an alternative strategy might be to log the strack trace immediately followed by an 'empty' event like ... INFO - stack trace end. This 'empty' event can be filtered out by filebeat (exclude_lines), but adds some nice bracketing for stack-traces. Having some bracketing the timeout can safely be set to -1. Bracketing + start-condition would be ideal in your case I think.

Thanks Steffens!!!!
Great Idea!!
I am just not too sure, I will be able to do it with the application.

Ori

Here is an example of a second part of an event:

\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000oup$1$1474290082222$6223b249fbaf3b0788731ac45f4644c6edc04eb612afbaaf1bd1f2b4f7082564:1948417363-SessionsCounterNewCacheInfra:Touching session task : sessionSuffix=17469564 local sessions map=SessionsCounterNewCacheInfra_allSessions

What does the \u0000 stands for ?
I have encountered the same with Single line events....
Can it be that not the whole event is being written at once for single line also ?

Ori

Hm.... The \u0000 is the unicode NULL character. It's literally a byte with value 0 written to the file. Can you check with hex viewer there are some 0-bytes in the log file? Normally it's used similar to C-style \0 to indicate end of string. But so many consecutive NULL-characters...

Any 'empty' buffers being flushed somewhere or some buffer offsets increased + flushed before putting the content in place ... ?

Is log file using some binary formatter? The $ characters look like some additional separator and this value 1474290082222 almost looks like a timestamp to me.

IT is in the Elasticsearch, it is what the Ingest node receives from Filebeat.
So, can it be that the event is splitted after a timeout, also for a single line event ?

The Null's reminded me of this: https://stackoverflow.com/questions/6814404/java-inputstream-read-methods-returning-ascii-nul-characters-for-file-in-a-nfs Any chance that is happening in your case?