Multiline filter fails on one record when handling a large file

Hi Logstasher's,

I'm seeing some odd behaviour with Logstash when trying to read files with a multiline filter enabled. I noticed that occasionally I'll see a multiline log entry fail to get detected/tagged as multiline, which results in a grokparsefailure for the 'second' part of the log entry (the first part gets treated as a single line entry, the second doesn't conform to a subsequent grok pattern, hence the failure).

The odd thing is, this occurs when I'm reading a test file of several hundred log lines, When I chop the file size down to only contain the offending log, plus maybe a few other entries (lets say five log entries in all) then the multiline log file gets processed without any problems at all....

This implies some sort of threading/race issue, but I'm afraid I'm pretty stumped at the moment as to what it could be.

Below is my filter, which includes the multiline statement plus the subsequent grok:

    filter {
    	multiline {
    		patterns_dir => "C:\Applications\Elastic\BWLogReader_Test\logstash_patterns"
    		pattern => "^%{TIBCO_DATESTAMP} "
    		negate => true
    		what => "previous"
    	}
    if [type] == "tibco_bw" {
    	#Extract main fields
    	grok {
    		patterns_dir => "C:\Applications\Elastic\BWLogReader_Test\logstash_patterns"
    		match => [ "message", 
    					"%{TIBCO_DATESTAMP:log_time}(\sGMT\s%{INT})?\s%{BW_APP_NAME:application}-%{WORD:engine}\s%{WORD:log_level}\s\[%{NOTSPACE:log_type}\]\s%{GREEDYDATA:log_msg}" ]
    	}
}

output {
	stdout {
		codec => rubydebug{}
	}
}

Below is a sample of the data I see on standard out when the multiline fails (as you can see, Logstash has seen it as two lines):

{
       "@version" => "1",
     "@timestamp" => "2015-08-12T08:37:47.378Z",
           "host" => "L210845",
           "path" => "C:/Applications/Elastic/BWLogReader_Test/test_logs/test_bw_multiline.log",
           "type" => "tibco_bw",
       "log_time" => "2015 Aug 12 09:37:47:378",
    "application" => "BW.EAI_BW_PeopleSoftAppServices_3",
         "engine" => "PeopleSoftAppServices_3_055",
      "log_level" => "Debug",
       "log_type" => "BW-User",
        "log_msg" => "ResponseMessage Job-50528969 [BusinessServices/Peoplesoft/RenderTSO/Interface/SourceRequest.process/GA_GP_RenderTSOGroup/LogExit]: <root><ns0:ResponseMessage xmlns:ns0=\"http://w
ww.tibco.com/xmlns/ae2xsd/2002/05/ae/Common\"><MESSAGE>&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;",
         "job_id" => "50528969",
           "tags" => [
        [0] "job_id_found"
    ]
}
{
       "message" => "&lt;ResultStatus&gt;\n    &lt;STATUS_CODE&gt;0&lt;/STATUS_CODE&gt;\n&lt;/ResultStatus&gt;</MESSAGE></ns0:ResponseMessage><ReplySubject>CRMDomain.STEVE_2</ReplySubject></root> ",
      "@version" => "1",
    "@timestamp" => "2015-08-14T13:13:02.619Z",
          "host" => "L210845",
          "path" => "C:/Applications/Elastic/BWLogReader_Test/test_logs/test_bw_multiline.log",
          "type" => "tibco_bw",
          "tags" => [
        [0] "multiline",
        [1] "_grokparsefailure",
        [2] "job_id_not_found"
    ],
        "job_id" => "None"
}

...and here is the same record as processed by Logstash when I've made the input file much smaller (you can see, now nicely recognised as a multiline input and transformed into a single Logstash event):

{
       "@version" => "1",
     "@timestamp" => "2015-08-12T08:37:47.378Z",
           "host" => "L210845",
           "path" => "C:/Applications/Elastic/BWLogReader_Test/test_logs/test_multiline_1.log",
           "type" => "tibco_bw",
           "tags" => [
        [0] "multiline",
        [1] "job_id_found"
    ],
       "log_time" => "2015 Aug 12 09:37:47:378",
    "application" => "BW.EAI_BW_PeopleSoftAppServices_3",
         "engine" => "PeopleSoftAppServices_3_055",
      "log_level" => "Debug",
       "log_type" => "BW-User",
        "log_msg" => "ResponseMessage Job-50528969 [BusinessServices/Peoplesoft/RenderTSO/Interface/SourceRequest.process/GA_GP_RenderTSOGroup/LogExit]: <root><ns0:ResponseMessage xmlns:ns0=\"http://w
ww.tibco.com/xmlns/ae2xsd/2002/05/ae/Common\"><MESSAGE>&lt;?xml version=\"1.0\" encoding=\"UTF-8\"?&gt;\n&lt;ResultStatus&gt;\n    &lt;STATUS_CODE&gt;0&lt;/STATUS_CODE&gt;\n&lt;/ResultStatus&gt;</MESS
AGE></ns0:ResponseMessage><ReplySubject>CRMDomain.STEVE_2</ReplySubject></root> ",
         "job_id" => "50528969"
}

If anyone can shed any light on this, that would be really helpful. At the moment it's driving me crazy... :joy:

Cheers,
Steve

All,

A bit more information on this issue. After doing some further investigation I've noticed:

  1. The error always occurs to the record at a specific location within the
    input file. This occurs whatever the record at this line
    is. Reordering the data within the test file doesn't change this so the problem is not with a particular record.
  2. If I disable the output to ElasticSearch and simply log to an output
    file the error does not occur at all. In this case all records get
    processed correctly.

I've opened this as an issue in GitHub (see Multiline filter failure on output to ElasticSearch - seems to be location dependent within an input file) but if anyone on here can shed any light on it it would still be useful.

Cheers,
Steve

Hi.

I had the same issue with long multiline json input and parsing errors using the multiline filter.
First, try the "max_age" option. If I set this higher than the 5 seconds default, less errors occur.
It seems that if processing a json object takes longer than max age, this json object is always parsed correctly. But always the following json object fails - no matter how long it is. This was driving me crazy :worried:

But nevertheless using logstash to parse long mulitline json object is not a good idea: For me it took more than 30 seconds to parse one 15k-lines object. Using five lines of perl it took 0.05 seconds to parse a file with six of those json objects.

I didn't even get an answer on my so question. My suggestion is, the multiline filter isn't used very often for such tasks.

@stevedearl have you tried this with multiline codec, or just the filter?

Hi Chaps - thanks for the responses.

Thomas, I don't think the length of the lines is a factor in my case as they're only perhaps five lines long with less than 200 chars per line.

Andrew - I did switch to the multiline codec and am using that as an interim for now. It doesn't seem to suffer from the issue I was having. However, it does have another problem (from what I can see) in that it only detects (and processes) a multiline log event when the next (non-multiline) log event occurs. This makes some sense, as how does Logstash know that the next line to be written isn't a continuation of the multiline event.
This is a problem though where logging is intermittent and there may be a considerable period before the next event occurs.

Unfortunately there doesn't seem to be a 'max_age' parameter for the multiline codec as there is with the multiline filter. This tells the filter to treat the multiline log event as complete after n seconds (if I remember correctly) which effectively 'flushes' the multiline event into the Logstash pipeline for processing without needing to wait for the next event.

Cheers,
Steve

@stevedearl we're most likely going to deprecate or remove the multiline filter in favor of the multiline codec at some point so this information is very useful to us.

I'm somewhat curious what the problem domain is where this 'max_age' parameter is acceptable. If you rely on this isn't it the case that if there's a continuous stream that you could have an event that never actually flushes?

FWIW we're planning on bringing the codec up to parity with the filter before we remove the filter.

Interesting, it's certainly not a ruby lang issue, but almost certainly an implementation issue.

If you'd like to submit a patch it'd be welcome! TBH though most multiline uses are on the order of 2-50 lines, not 15k

Hi @Andrew_Cholakian1,

Actually, we're not at the stage of using ELK for production - I was looking into it as a proof of concept so I don't have a true use case involving 'max_age'.

I don't think there's a huge risk of a 'continuous stream' when using the multiline filter and max_age as the value of the 'pattern' parameter should eventually allow the filter to recognise the end of the multiline event and the start of the next one, and flush the multine event into the pipeline. I think an issue may be that if max_age is too short there's a chance that the flush occurs before the end of the multiline event - but again I think that's unlikely.

I appreciate the information on the switch to the codec from the filter though. I'll keep an eye out for when that happens. Let me know if I can get you any other information that might help.

Cheers,
Steve

I have also been using multiline filter for really large logs as above. At times hundreds of kilobytes. I have came across OOM which got resolved by increasing LS heap size. Few incorrect logs as mentioned in OP, were corrected using max_age in my case. But yes I also do find multiline filter not very reliable. Going by the comments above it looks like multiline codec is the way forward. I also faced entity expansion too large rexml errors which I think I managed by setting it to a larger value in rexml.rb. By the way are you using the latest logstash version?

Steve, what I'm getting at is have you considered max_lines or max_bytes ? Time is an add dimension to clip on is my greater point. What if you have very high throughput over a short period?

Also, if you aren't aware of these settings the default max_lines setting is 500, which may be the source of some issues for you if you rely on this being unbounded.

Yep. Use the codec. When using multiline with multiple inputs you can get very weird results unless you are extremely careful. That's why we're going to deprecate it.

Hi @andrewvc, thanks for the additional comments. I suspect I'll stick with the codec for now, given it's going to be the way forward in the future. I do think there is a use-case for it handling correctly the scenarion where we get a multiline event but the next log event is not received until some time later. If this subsequent event is an hour after the multiline event then the multiline event wouldn't become visible in Kibana for an hour.

I agree that in most logging scenarios this would be unlikely as the rate of logging would be higher, but we might have a few cases where the log traffic is pretty sparse.

Regards,
Steve