Multiline sends unmatched multiline beginning of file #1611

filebeat-1.2.
Windows 2007

Steps to Reproduce:

Extract sample.zip (attached)run filebeat -e -v -c filebeat.yml

----------------------------------- filebeat.yml

# This is the main configuration file for this 
filebeat:
  prospectors:

    -
      paths:
        - sample.log
      input_type: log
      multiline:
        max_lines: 7
        pattern: ".*---------- Begin Data ----------"
        negate: true
        match: after
      scan_frequency: ${SCAN_FREQUENCY:1s}
      tail_files: ${TAIL:false}
  
  spool_size: ${SPOOL_SIZE:2048}
  idle_timeout: ${IDLE_TIMEOUT:5s}
  
  registry_file: filebeat.registry
  
  config_dir: ${PROSPECTORS:./prospectors}
  
output:

  console:

shipper:

logging:

  level: ${LOG_LEVEL:warning}

  files:

    rotateeverybytes: ${LOG_SIZE:10485760} # = 10MB

----------------------------------- sample.log
04/18/2016 05:07:42.775 INFO=>Hello there world.
04/18/2016 05:07:42.775 INFO=>You should not see this
04/18/2016 05:07:42.776 INFO=>Are you seeing this and why
04/18/2016 05:07:42.776 INFO=>Yep some data here
04/18/2016 05:07:43.205 INFO=>More text
04/18/2016 05:07:50.770 INFO=>Even more stuff
04/18/2016 05:07:56.126 INFO=>This line has the
04/18/2016 05:07:56.126 INFO=>Hi there
04/18/2016 05:08:47.546
04/18/2016 05:08:47.546 ------------------ Begin Data -----------------
04/18/2016 05:08:47.546 This I want Line 2
04/18/2016 05:08:47.546 This I want Line 3
04/18/2016 05:08:47.546 This I want Line 4
04/18/2016 05:08:47.546 This I want Line 5
04/18/2016 05:08:47.546 This I want Line 6
04/18/2016 05:08:47.546 This I want Line 7
04/18/2016 05:08:47.546 --------------------------------------------------

Observed:

*********************** UNEXPECTED NON MATCHING DATA

2016/05/11 13:34:24.325552 publish.go:88: INFO Start sending events to output

{"@timestamp":"2016-05-11T13:34:24.286Z","beat":{"hostname":"ITEM-AX35133","name":"ITEM-AX35133"},"count":1,"fields":null,"input_type":"log","message":"04/18/2016
05:07:42.775 INFO=\u003eHello there world.\r\n04/18/2016 05:07:42.775
INFO=\u003eYou should not see this\r\n04/18/2016 05:07:42.776
INFO=\u003eAre you seeing this and why\r\n04/18/2016 05:07:42.776
INFO=\u003eYep some data here\r\n04/18/2016 05:07:43.205
INFO=\u003eMore text\r\n04/18/2016 05:07:50.770 INFO=\u003eEven more
stuff\r\n04/18/2016 05:07:56.126 INFO=\u003eThis line has
the","offset":0,"source":"sample.log","type":"log"}

2016/05/11 13:34:26.777797 publish.go:104: INFO Events sent: 1

2016/05/11 13:34:26.778797 registrar.go:157: INFO Registry file updated. 1 states written.

*********************** MATCHING DATA

{"@timestamp":"2016-05-11T13:34:24.286Z","beat":{"hostname":"ITEM-AX35133","name":"ITEM-AX35133"},"count":1,"fields":null,"input_type":"log","message":"04/18/2016
05:08:47.546 ------------------ Begin Data
-----------------\r\n04/18/2016 05:08:47.546 This I want Line
2\r\n04/18/2016 05:08:47.546 This I want Line 3\r\n04/18/2016
05:08:47.546 This I want Line 4\r\n04/18/2016 05:08:47.546

This I want Line 5\r\n04/18/2016 05:08:47.546 This I want Line
6\r\n04/18/2016 05:08:47.546 This I want Line
7","offset":426,"source":"sample.log","type":"log"}

2016/05/11 13:34:31.778297 publish.go:104: INFO Events sent: 1

2016/05/11 13:34:31.779297 registrar.go:157: INFO Registry file updated. 1 states written.

filebeat currently does not support start/end pattern. I see log-lines before multiline events having a log-level. Is this always the case? Does multiline event contain log-level?

@Stuart_Fraser Please do not double post issues and always use the discuss forum first as it is stated in the github issue: https://github.com/elastic/beats/issues/1611

Found issues before discussion. Saw message after opening the issue to use discussions.

I will start with discussion going forward. Thanks.

We are using max_lines to effectively get the begin-end block. This is working fine.

If possible, we'd prefer to not forward the undesired first few lines from the files. That is, the first block of lines does not match our pattern, but it is consistently forwarded.

The source log files are pretty much exactly like the sample.

Here's some sample patterns from the middle of a log file. There are lines that actually have no information (perhaps to make the file readable). And there are lines that appear to have a log level.

04/18/2016 05:15:20.054 ------------------ Begin Request -----------------
04/18/2016 05:15:20.054 Several lines of data we want here
....
04/18/2016 05:15:20.054 --------------------------------------------------
04/18/2016 05:15:20.054
04/18/2016 05:15:20.054 INFO=>Calling Service: ABC
04/18/2016 05:15:20.055 INFO=>User ID: username
04/18/2016 05:15:20.055 INFO=>App Code:
04/18/2016 05:15:20.114
04/18/2016 05:15:20.114 ------------------- End Request ------------------
04/18/2016 05:15:20.114 More lines of data here we want
04/18/2016 05:15:20.114 .. configured as another prospector
....
04/18/2016 05:15:20.114 --------------------------------------------------
04/18/2016 05:15:20.114
04/18/2016 05:25:06.874
04/18/2016 05:25:06.874 ------------------ Begin Request -----------------
04/18/2016 05:15:20.054 Several lines of data we want here
....
04/18/2016 05:25:06.874 --------------------------------------------------

I have not seen an update. Is the issue being investigated?

We will look into this as soon as we get the time. The way I would approach it would be to write a system test to check if I can reproduce it and then find a solution to get the system test green. So in case you have some python experience it would be very helpful to get a PR with system tests that reproduces the problem: https://github.com/elastic/beats/tree/master/filebeat/tests/system

In case you need urgent support, we also have paid subscriptions: https://www.elastic.co/subscriptions

How exactly does you log look like. How is data here we want formatted? Just assuming data here we want is just plain text without empty lines and everything else is either empty lines, ----- lines or starts with INFO=> one can build a pattern matching exactly these lines + negate + use exlucde_lines option to filter out these unwanted lines.

start-stop patterns are not supported yet and without knowing exact formatting of wanted data or if any of the mentioned patterns can not be safely filtered I have a hard time to define a workaround regex (if possible at all).