Catching all lines between two lines (two lines included)


#1

Hello,

I have not found a way to do this with multiline.
Basically what I want to do is take all lines between two specific lines, and make a single message out of it.

Example :

Line I don't want
Line I don't want
Line I don't want
--- BEGINNING OF BLOCK I WANT
Line of the block I want
Line of the block I want
Line of the block I want
Line of the block I want
Line of the block I want
--- END OF BLOCK I WANT
Line I don't want
Line I don't want
Line I don't want
Line I don't want

I tried to use multiline to catch any line that follow the line with ^--- [A-Z]+ pattern but obviously doesn't work since it would ignore the end of the block line, and then everything is a mess.

Any tips ?
Thanks.


#2

I was thinking that I could maybe take everything after --- BEGINNING OF BLOCK I WANT then include only the lines that are in the block (they are always the same pattern it seems), but that would be very unpractical, and not even sure it would work, since the include_files would apply on a block (since multiline is applied before the include_files).

Would greatly appreciate some guidance :slight_smile:


(Tudor Golubenco) #3

I'm afraid we don't have a good solution for this with the current multiline implementation. For a proper solution, we'd need to start and stop patterns for multiline, which we discussed before but never implemented. Posting a ticket for an enhancement request for this would make sense.

If you post a real portion of your logs, perhaps we can think of a workaround.


#4

Thanks for your reply, I understand.

Here's a sample (I'm sorry it's in french), what I would want is the block, or at least all the lines but the last line :

04/10/2016 09:01:48 ANS1898I ***** 5 115 000 fichiers trait▒s *****
04/10/2016 09:01:49 ANS1898I ***** 5 119 000 fichiers trait▒s *****
04/10/2016 09:02:29 ANS1999E Le traitement de Incr▒mentale pour '/' est arr▒t▒.

04/10/2016 09:02:29 --- DEBUT ETAT JOURNAL DES OPERATIONS PLANIFIEES
04/10/2016 09:02:29 Nombre total d'objets inspect▒s : 5 119 261
04/10/2016 09:02:29 Nombre total d'objets sauvegard▒s : 163
04/10/2016 09:02:29 Nombre total d'objets mis ▒ jour : 0
04/10/2016 09:02:29 Nombre total d'objets reli▒s : 0
04/10/2016 09:02:29 Nombre total d'objets supprim▒s : 0
04/10/2016 09:02:29 Nombre total d'objets expir▒s : 0
04/10/2016 09:02:29 Nombre total d'objets en ▒chec : 0
04/10/2016 09:02:29 Nombre total d'objets chiffr▒s : 0
04/10/2016 09:02:29 Le nombre total d'objets a augment▒ : 0
04/10/2016 09:02:29 Nombre total de tentatives : 0
04/10/2016 09:02:29 Nombre total d'octets inspect▒s : 7,99 TB
04/10/2016 09:02:29 Nombre total d'octets transf▒r▒s : 927,09 MB
04/10/2016 09:02:29 Dur▒e de transfert des donn▒es : 101,22 sec
04/10/2016 09:02:29 D▒bit de transfert de donn▒es du r▒seau : 9 378,40 ko/s
04/10/2016 09:02:29 D▒bit de transfert d'un groupe de fichiers : 960,91 ko/s
04/10/2016 09:02:29 Taux de compression des objets : 0%
04/10/2016 09:02:29 Rapport de r▒duction des donn▒es total : 99,99%
04/10/2016 09:02:29 Temps de traitement ▒coul▒ : 00:16:27
04/10/2016 09:02:29 --- FIN ETAT JOURNAL DES OPERATIONS PLANIFIEES
04/10/2016 09:02:29 ANS4023E Erreur lors du traitement de '/' : erreur d'entr▒e/sortie sur le fichier

Thanks.


(Tudor Golubenco) #5

The workaround that I'm thinking is to have a prospector that:

  • Filters out all the lines starting with ANS
  • Filters out the FIN line
  • Groups together using multiline everything after the DEBUT line.

But this assume you don't need the ANS lines. If you need those, you'll need a second Filebeat instance with a separate registry file to get them (and only them), because I think using multiple prospectors on the same file won't work well.

Just an idea.


#6

Thanks for the suggestion. I will try something like that, though I think it may not work since multilines are applied before the exclude/include_lines (as said in the documentation).

I think I will be able to make it work somehow anyway.

Regards.


(Steffen Siering) #7

There is not much structural difference between lines to be composed into multiline and other lines. This makes matching a little hard.

All fields seem to be some 'numeric' stats , pointing to a regex pattern like ' : \d' for matching a collon followed by a single digit. This can be refined a little by including some keywords to reduce the chance of false-positives like: (total|transfert|objects|Temps de).+ : \d . The | operator means or introducing some kind of backtracking. Depending on match setting you use, you can modify the regex to also include '^--- FIN' or '^--- DEBUT'. For example

'(^--- FIN)|((total|transfert|objects|Temps de).+ : \d)'

It can be kind of tricky to write a regex pattern like this without creating false-positives and might require some refinements, but it might be a start.

Feel free to create an enhancement request, as having a start/end-pattern like matcher would solve the problem much more robustly.


#8

I can totally see this working. I will update it if does :slight_smile:

Thanks a whole lot!


#9

It did work, thanks again :slight_smile:
The lines in the block were always the same, so I just went ahead and wrote them completely in the regex, to be sure it won't catch anything else.

Then I used an include_lines option to throw away all lines that don't start with --- DEBUT ETAT JOURNAL

Have a nice day.


(Steffen Siering) #10

Wow, you must have a really big regex by now. As you're filtering out anything not starting with --- DEBUT some false positives outside of your multiline shouldn't be bad at all.

Regex engines often require O(n^2) for string matching (the matcher searches for a substring) + backtracking all the patterns might be relatively expensive. I'd monitor filebeat CPU usage and check if using another regex would optimize resource usage a little. Having a pattern starting with '^' can create a one-pass regex in some cases.

e.g. '^.{20} ((--- DEBUT)|((Nombre|Le nombre|Rapport).+ (total|transfert|objects|Temps de).+ : \d))'

here I'm using ^.{20} to ignore the timestamp (exactly 20 characters) + introduce some early stop words like (Nombre|Le nombre|Rapport) + introduce some more 'evidence' for filtering out false positives: (total|transfert|objects|Temps de) and .+ : \d.

With () introducing capture groups the matcher might still capture them + throw the results away (depends on actual implementation). Using (?:) introduces a non-capturing group:

'^.{20} (?:(?:--- DEBUT)|(?:(?:Nombre|Le nombre|Rapport).+(?:total|transfert|objects|Temps de).+ : \d))'

Here is a tool to analyze the generated regex: https://github.com/urso/anareg

Usage:

 ./anareg '^.{20} (?:(?:--- DEBUT)|(?:(?:Nombre|Le nombre|Rapport).+(?:total|transfert|objects|Temps de).+ : \d))' | dot -Tsvg > tst.svg

remember to verify resource usage when changing patterns. In case of CPU not being really different for different patterns use the most maintainable one.


#11

Wow, that's some amazing work there, thanks lots!

My regexes wasn't actually really long (for either the multiline or the include_files)

The multiline regex was something like this :

^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert) etc. (can't c/c, not at work at the moment).

This worked and gave me the correct block starting with --- DEBUT ETAT JOURNAL and ending with FIN ETAT JOURNAL (note that I must add ETAT JOURNAL because there are other blocks that starts with DEBUT and end with FIN)

I then saw a lot of lines getting caught nevertheless, and those weren't following any patterns so I just decided to use the include_files option to trash them, like this :
include_files : ['^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: --- DEBUT ETAT JOURNAL']

I haven't seen the result of that include_lines yet since I need to wait for the server to generate some logs (it does every morning, so I will see tomorrow).

In any case, it's true that I didn't think about the CPU usage at all. Your regex is relly neat, it's doing basically the same thing with far less characters, I will go with something like that. The graph is also really helpful, greatly appreciated.

Thanks again! Will update when I get the final/optimized version in case someone else needs it.


#12

Posting the pattern here in case it can help someone (who knows):

multiline pattern:
pattern: '^[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert)|(D.bit de transfert)|(Taux de compression)|(Rapport de r.duction)|(Temps de traitement)'
negate: false
match: after

include_lines pattern:
include_lines: [".+ DEBUT ETAT JOURNAL"]

Seeing that it wors so well, I'm not gonna change it for now. If the CPU usage becomes a problem, I will change it.

Thanks again.
Cheers.


(system) #13

This topic was automatically closed after 21 days. New replies are no longer allowed.