Multiple Multiline Options

Running into a problem caused by our developers not using consistent data/time format.

Jul 06, 2017 6:24:58 PM org.apache.catalina.startup.HostConfig undeploy
INFO: Undeploying context []
2017-07-06 18:24:59.103 [ContainerBackgroundProcessor[StandardEngine[Catalina]]] [] WARN o.s.b.f.s.DisposableBeanAdapter - Invocation of destroy method failed on bean with name 'ehcache': java.lang.NoClassDefFoundError: net/sf/ehcache/store/chm/SelectableConcurrentHashMap$KeySet
Jul 06, 2017 6:24:59 PM org.apache.catalina.loader.WebappClassLoaderBase clearReferencesThreads
SEVERE: The web application [] appears to have started a thread named [Statistics Thread] but has failed to stop it. This is very likely to create a memory leak.

Is there any "or" functionality in the multiline? If not, our filebeat deployment is probably DOA. I need to be able to match EITHER date/time format to detect the end of multiline entries.

Is this doable?

Want somthing like this:

default['filebeat']['prospectors']['catalina.out'] = '

Catalina.out log

  • input_type: log
    • /var/log/tomcat7/catalina.out
      multiline.pattern: '^[[0-9]{4}-[0-9]{2}-[0-9]{2}' or 'someuglyregexforenglishmonths'
      multiline.negate: true
      multiline.match: after
      exclude_lines: ["INFO: ", "WARN: ", "[] DEBUG"]

Seriously, different time formats in same log file?

See regular expression syntax docs:

For this use-case I'd use an or-expression (expr1)|(expr2).

Fortunately this is still a simple use-case, but feel free to open an enhancement request to configure multiple multiline state-machines per log-file (in case things become more complicated).

Ouch. You advice seems to work from the regex expression approach though the readability suffers:

multiline.pattern: '(^[0-9]{4}-(((0[13578]|(10|12))-(0[1-9]|[1-2][0-9]|3[0-1]))|(02-(0[1-9]|[1-2][0-9]))|((0[469]|11)-(0[1-9]|[1-2][0-9]|30)))$)|(\b\d{1,2}\D{0,3})?\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)\D?(\d{1,2}\D?)?\D?((19[7-9]\d|20\d{2})|\d{2})'

Will put in the feature request, but this looks like it may work.

1 Like

Using the pattern above, I am not seeing data in Kibana that makes sense. Near as I can tell, the RegEX will indeed detect the date strings, but are wildcards required to match the rest of the line? Tried that, and almost works, but seeing splits in odd places. Pulling from the log file directory, I find this message: When I check in Kibana, the section in bold is where the message begins, and it ends in 5 lines, not on the next date field, but on just before the word "Caused: " which doesnt match the filter at all.

2017-07-07 03:11:04.865 [http-apr-8080-exec-147] [Ar0zklkfR9Kjh9tb7DLpEw] DEBUG c.r.a.s.jersey.log.LoggingFilter - WebRequestLog{httpMethod=POST, url=
pi/v2/authentication/sa, protocol=http, request={ "userGuid":"cknv1piopp", "deviceIdentifiers":{"imei":"99000697260243", "modelCode":"",
at [guice-servlet-4.0-beta5.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at com.rosetta.adk.hibernate.filter.MultiPersistFilter.doFilter( [adk-hibernate-1.0.26.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at com.yammer.metrics.web.WebappMetricsFilter.doFilter( [metrics-web-2.2.0.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at$ [guice-servlet-4.0-beta5.jar:na]
at$ [guice-servlet-4.0-beta5.jar:na]
at$ [guice-servlet-4.0-beta5.jar:na]
at [guice-servlet-4.0-beta5.jar:na]
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( [catalina.jar:7.0.65]
at org.apache.catalina.core.ApplicationFilterChain.doFilter( [catalina.jar:7.0.65]
at org.apache.catalina.core.StandardWrapperValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.core.StandardContextValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.authenticator.AuthenticatorBase.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.valves.AccessLogValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.core.StandardHostValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.valves.ErrorReportValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.valves.AccessLogValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.core.StandardEngineValve.invoke( [catalina.jar:7.0.65]
at org.apache.catalina.connector.CoyoteAdapter.service( [catalina.jar:7.0.65]
at org.apache.coyote.http11.AbstractHttp11Processor.process( [tomcat-coyote.jar:7.0.65]
at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process( [tomcat-coyote.jar:7.0.65]
at$SocketProcessor.doRun( [tomcat-coyote.jar:7.0.65]
at$ [tomcat-coyote.jar:7.0.65]
** at java.util.concurrent.ThreadPoolExecutor.runWorker( [na:1.8.0_20]**
** at java.util.concurrent.ThreadPoolExecutor$ [na:1.8.0_20]**
** at org.apache.tomcat.util.threads.TaskThread$ [tomcat-coyote.jar:7.0.65]**
** at [na:1.8.0_20]**
Caused by: com.rosetta.samsmk.oep.exception.application.DeviceNotFoundException: Invalid imei
at com.rosetta.samsmk.oep.registration.service.NahqDeviceValid

Your regex looks somewhat complicated to me... Well, regexes do always look complicated...
No need to be to strict in the regex. Don't try to parse the date, but capture the 'shape.

Plus I'm not sure the ^ operator used is really active for all patterns.

In my multiline example I've been using variables to split the patterns:

One can simulate variables in beats configuration files via (I didn't test this, hope it works ;)):

- type: log
  multiline.pattern: '^${patterns.timestamp}'

  timestamp: '((${patterns.timestamp1})|(${patterns.timestamp2}))'

  # capture dates of type '2017-01-01 01:02:03.456'
  timestamp1:  '\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}'

  # captures dates of type 'Jan 01, 2017 1:23:45 AM' or  'Jan 01, 2017 11:23:45 PM'
  timestamp2: '[JFMAMSOND][a-z]{2} \d{2}, \d{4} \d\d?:\d{2}:\d{2} [AP]m'

With things becoming more complicated and devs potentially adding more quirks, you might consider some automated testing support:

  • have corpus of logs
  • have corresponding json file with actual events

We do this for filebeat modules. Every module has a defined test corpus (e.g. apache access module) we do run as part of our system tests (would be cool if filebeat would have a script to run these kind of tests automatically). The tests use the file output in filebeat, to load and compare the events with the expected output.

Having automated tests, gives you a chance to capture breakages in case of you having to adapt/refine the regular expression.

Not sure I do understand the line splitting issue. But we've seen similar issues with events being split in multiline with log4j RollingFileAppender. Issues have been caused by the Appender buffering content without (or too long) flush timeouts (buffer is only flushed once full). The multiline state machine in filebeat has a flush timeout, sending events if it's waiting for too long. Reason is, the state machine can not tell if file/event is over when no more content is appended to the file. You can disable the timeout by setting multiline.timeout: 0. If the multiline event is the last event in the log file, the event will be send once filebeat finally closes the log file.

Actually, I like your suggestion. My regex is actually a single regex for each date format separated by an "or" character. Defining each data format as a separate regex in a variable would make it easier to read. May end up making a library of regex's that can be called that way. Thanks for your advice!

btw. you can use the -c setting to load multiple configuration files. This way you can provide a set of predefined regex patterns in a separate file...

Having a list of predefined patterns in an extra file is actually what grok/logstash is doing. Having to type ${pattern.<name>} and such is not very convenient. Not sure how this will look in beats (I'd prefer not to have to ship to many files with a beat), but maybe we can define a default namespace to look at for regexp patterns.

Then my sample (assuming default namespace being pattern) would become:

- type: log
  multiline.pattern: '^${timestamp}'

  timestamp: '(${timestamp1}|${timestamp2})'

  # capture dates of type '2017-01-01 01:02:03.456'
  timestamp1:  '(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})'

  # captures dates of type 'Jan 01, 2017 1:23:45 AM' or  'Jan 01, 2017 11:23:45 PM'
  timestamp2: '([JFMAMSOND][a-z]{2} \d{2}, \d{4} \d\d?:\d{2}:\d{2} [AP]m)'

Next step would be to ship a predefined (builtin) set of patterns with beats, though.

I have no idea how a exactly this kind of feature will look like in beats. But feel free to open an enhancement request for further discussions.

While convenient, the patterns generated this way are often not 'optimal' + in loads of multiline cases I've seen overkill (bad for performance).

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.