Catching all lines between two lines (two lines included)

Houss · October 17, 2016, 5:52am

Hello,

I have not found a way to do this with multiline.
Basically what I want to do is take all lines between two specific lines, and make a single message out of it.

Example :

Line I don't want
Line I don't want
Line I don't want
--- BEGINNING OF BLOCK I WANT
Line of the block I want
Line of the block I want
Line of the block I want
Line of the block I want
Line of the block I want
--- END OF BLOCK I WANT
Line I don't want
Line I don't want
Line I don't want
Line I don't want

I tried to use multiline to catch any line that follow the line with ^--- [A-Z]+ pattern but obviously doesn't work since it would ignore the end of the block line, and then everything is a mess.

Any tips ?
Thanks.

Houss · October 18, 2016, 7:06am

I was thinking that I could maybe take everything after --- BEGINNING OF BLOCK I WANT then include only the lines that are in the block (they are always the same pattern it seems), but that would be very unpractical, and not even sure it would work, since the include_files would apply on a block (since multiline is applied before the include_files).

Would greatly appreciate some guidance

tudor · October 18, 2016, 8:39am

I'm afraid we don't have a good solution for this with the current multiline implementation. For a proper solution, we'd need to start and stop patterns for multiline, which we discussed before but never implemented. Posting a ticket for an enhancement request for this would make sense.

If you post a real portion of your logs, perhaps we can think of a workaround.

Houss · October 18, 2016, 9:31am

Thanks for your reply, I understand.

Here's a sample (I'm sorry it's in french), what I would want is the block, or at least all the lines but the last line :

04/10/2016 09:01:48 ANS1898I ***** 5 115 000 fichiers trait▒s *****
04/10/2016 09:01:49 ANS1898I ***** 5 119 000 fichiers trait▒s *****
04/10/2016 09:02:29 ANS1999E Le traitement de Incr▒mentale pour '/' est arr▒t▒.

04/10/2016 09:02:29 --- DEBUT ETAT JOURNAL DES 04/10/2016 09:02:29 Nombre total d'objets inspect▒s 04/10/2016 09:02:29 Nombre total d'objets sauvegard▒s 04/10/2016 09:02:29 Nombre total d'objets mis ▒ jour 04/10/2016 09:02:29 Nombre total d'objets reli▒s 04/10/2016 09:02:29 Nombre total d'objets supprim▒s 04/10/2016 09:02:29 Nombre total d'objets expir▒s 04/10/2016 09:02:29 Nombre total d'objets en ▒chec 04/10/2016 09:02:29 Nombre total d'objets chiffr▒s 04/10/2016 09:02:29 Le nombre total d'objets a augment▒ 04/10/2016 09:02:29 Nombre total de tentatives 04/10/2016 09:02:29 Nombre total d'octets inspect▒s 04/10/2016 09:02:29 Nombre total d'octets transf▒r▒s 04/10/2016 09:02:29 Dur▒e de transfert des donn▒es 04/10/2016 09:02:29 D▒bit de transfert de donn▒es du r▒seau 04/10/2016 09:02:29 D▒bit de transfert d'un groupe de fichiers 04/10/2016 09:02:29 Taux de compression des objets 04/10/2016 09:02:29 Rapport de r▒duction des donn▒es total 04/10/2016 09:02:29 Temps de traitement ▒coul▒ 04/10/2016 09:02:29 --- FIN ETAT JOURNAL DES 04/10/2016 09:02:29 ANS4023E Erreur lors du traitement OPERATIONS PLANIFIEES
: 5 119 261
: 163
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 7,99 TB
: 927,09 MB
: 101,22 sec
: 9 378,40 ko/s
: 960,91 ko/s
: 0%
: 99,99%
: 00:16:27
OPERATIONS PLANIFIEES
de '/' : erreur d'entr▒e/sortie sur le fichier

Thanks.

tudor · October 18, 2016, 9:37am

The workaround that I'm thinking is to have a prospector that:

Filters out all the lines starting with ANS
Filters out the FIN line
Groups together using multiline everything after the DEBUT line.

But this assume you don't need the ANS lines. If you need those, you'll need a second Filebeat instance with a separate registry file to get them (and only them), because I think using multiple prospectors on the same file won't work well.

Just an idea.

Houss · October 18, 2016, 9:42am

Thanks for the suggestion. I will try something like that, though I think it may not work since multilines are applied before the exclude/include_lines (as said in the documentation).

I think I will be able to make it work somehow anyway.

Regards.

steffens · October 18, 2016, 11:02am

There is not much structural difference between lines to be composed into multiline and other lines. This makes matching a little hard.

All fields seem to be some 'numeric' stats , pointing to a regex pattern like ' : \d' for matching a collon followed by a single digit. This can be refined a little by including some keywords to reduce the chance of false-positives like: (total|transfert|objects|Temps de).+ : \d . The | operator means or introducing some kind of backtracking. Depending on match setting you use, you can modify the regex to also include '^--- FIN' or '^--- DEBUT'. For example

'(^--- FIN)|((total|transfert|objects|Temps de).+ : \d)'

It can be kind of tricky to write a regex pattern like this without creating false-positives and might require some refinements, but it might be a start.

Feel free to create an enhancement request, as having a start/end-pattern like matcher would solve the problem much more robustly.

Houss · October 18, 2016, 5:44pm

I can totally see this working. I will update it if does

Thanks a whole lot!

Houss · October 19, 2016, 5:44am

It did work, thanks again
The lines in the block were always the same, so I just went ahead and wrote them completely in the regex, to be sure it won't catch anything else.

Then I used an include_lines option to throw away all lines that don't start with --- DEBUT ETAT JOURNAL

Have a nice day.

steffens · October 19, 2016, 11:38am

Wow, you must have a really big regex by now. As you're filtering out anything not starting with --- DEBUT some false positives outside of your multiline shouldn't be bad at all.

Regex engines often require O(n^2) for string matching (the matcher searches for a substring) + backtracking all the patterns might be relatively expensive. I'd monitor filebeat CPU usage and check if using another regex would optimize resource usage a little. Having a pattern starting with '^' can create a one-pass regex in some cases.

e.g. '^.{20} ((--- DEBUT)|((Nombre|Le nombre|Rapport).+ (total|transfert|objects|Temps de).+ : \d))'

here I'm using ^.{20} to ignore the timestamp (exactly 20 characters) + introduce some early stop words like (Nombre|Le nombre|Rapport) + introduce some more 'evidence' for filtering out false positives: (total|transfert|objects|Temps de) and .+ : \d.

With () introducing capture groups the matcher might still capture them + throw the results away (depends on actual implementation). Using (?:) introduces a non-capturing group:

'^.{20} (?:(?:--- DEBUT)|(?:(?:Nombre|Le nombre|Rapport).+(?:total|transfert|objects|Temps de).+ : \d))'

Here is a tool to analyze the generated regex: https://github.com/urso/anareg

Usage:

 ./anareg '^.{20} (?:(?:--- DEBUT)|(?:(?:Nombre|Le nombre|Rapport).+(?:total|transfert|objects|Temps de).+ : \d))' | dot -Tsvg > tst.svg

remember to verify resource usage when changing patterns. In case of CPU not being really different for different patterns use the most maintainable one.

Houss · October 19, 2016, 2:11pm

Wow, that's some amazing work there, thanks lots!

My regexes wasn't actually really long (for either the multiline or the include_files)

The multiline regex was something like this :

^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert) etc. (can't c/c, not at work at the moment).

This worked and gave me the correct block starting with --- DEBUT ETAT JOURNAL and ending with FIN ETAT JOURNAL (note that I must add ETAT JOURNAL because there are other blocks that starts with DEBUT and end with FIN)

I then saw a lot of lines getting caught nevertheless, and those weren't following any patterns so I just decided to use the include_files option to trash them, like this :
include_files : ['^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: --- DEBUT ETAT JOURNAL']

I haven't seen the result of that include_lines yet since I need to wait for the server to generate some logs (it does every morning, so I will see tomorrow).

In any case, it's true that I didn't think about the CPU usage at all. Your regex is relly neat, it's doing basically the same thing with far less characters, I will go with something like that. The graph is also really helpful, greatly appreciated.

Thanks again! Will update when I get the final/optimized version in case someone else needs it.

Houss · October 24, 2016, 4:47am

Posting the pattern here in case it can help someone (who knows):

multiline pattern:
pattern: '^[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert)|(D.bit de transfert)|(Taux de compression)|(Rapport de r.duction)|(Temps de traitement)'
negate: false
match: after

include_lines pattern:
include_lines: [".+ DEBUT ETAT JOURNAL"]

Seeing that it wors so well, I'm not gonna change it for now. If the CPU usage becomes a problem, I will change it.

Thanks again.
Cheers.

system · November 7, 2016, 5:52am

This topic was automatically closed after 21 days. New replies are no longer allowed.

Topic		Replies	Views
Is there a filebat multline filter for 'start xxxx end' string? Beats filebeat	11	1577	August 19, 2016
How to get 10 before and after lines if pattern is matched in multiline in Filebeat Beats filebeat	3	512	July 20, 2018
Problem multiline logs Beats	6	1119	July 1, 2016
Multiline in filebeat Beats filebeat	3	732	December 19, 2017
Multiline sends unmatched multiline beginning of file #1611 Beats filebeat	8	1421	July 5, 2017

Catching all lines between two lines (two lines included)

Related topics