I have not found a way to do this with multiline.
Basically what I want to do is take all lines between two specific lines, and make a single message out of it.
Example :
Line I don't want Line I don't want Line I don't want --- BEGINNING OF BLOCK I WANT Line of the block I want Line of the block I want Line of the block I want Line of the block I want Line of the block I want --- END OF BLOCK I WANT Line I don't want Line I don't want Line I don't want Line I don't want
I tried to use multiline to catch any line that follow the line with ^--- [A-Z]+ pattern but obviously doesn't work since it would ignore the end of the block line, and then everything is a mess.
I was thinking that I could maybe take everything after --- BEGINNING OF BLOCK I WANT then include only the lines that are in the block (they are always the same pattern it seems), but that would be very unpractical, and not even sure it would work, since the include_files would apply on a block (since multiline is applied before the include_files).
I'm afraid we don't have a good solution for this with the current multiline implementation. For a proper solution, we'd need to start and stop patterns for multiline, which we discussed before but never implemented. Posting a ticket for an enhancement request for this would make sense.
If you post a real portion of your logs, perhaps we can think of a workaround.
Here's a sample (I'm sorry it's in french), what I would want is the block, or at least all the lines but the last line :
04/10/2016 09:01:48 ANS1898I ***** 5 115 000 fichiers trait▒s ***** 04/10/2016 09:01:49 ANS1898I ***** 5 119 000 fichiers trait▒s ***** 04/10/2016 09:02:29 ANS1999E Le traitement de Incr▒mentale pour '/' est arr▒t▒.
04/10/2016 09:02:29 --- DEBUT ETAT JOURNAL DES OPERATIONS PLANIFIEES 04/10/2016 09:02:29 Nombre total d'objets inspect▒s : 5 119 261 04/10/2016 09:02:29 Nombre total d'objets sauvegard▒s : 163 04/10/2016 09:02:29 Nombre total d'objets mis ▒ jour : 0 04/10/2016 09:02:29 Nombre total d'objets reli▒s : 0 04/10/2016 09:02:29 Nombre total d'objets supprim▒s : 0 04/10/2016 09:02:29 Nombre total d'objets expir▒s : 0 04/10/2016 09:02:29 Nombre total d'objets en ▒chec : 0 04/10/2016 09:02:29 Nombre total d'objets chiffr▒s : 0 04/10/2016 09:02:29 Le nombre total d'objets a augment▒ : 0 04/10/2016 09:02:29 Nombre total de tentatives : 0 04/10/2016 09:02:29 Nombre total d'octets inspect▒s : 7,99 TB 04/10/2016 09:02:29 Nombre total d'octets transf▒r▒s : 927,09 MB 04/10/2016 09:02:29 Dur▒e de transfert des donn▒es : 101,22 sec 04/10/2016 09:02:29 D▒bit de transfert de donn▒es du r▒seau : 9 378,40 ko/s 04/10/2016 09:02:29 D▒bit de transfert d'un groupe de fichiers : 960,91 ko/s 04/10/2016 09:02:29 Taux de compression des objets : 0% 04/10/2016 09:02:29 Rapport de r▒duction des donn▒es total : 99,99% 04/10/2016 09:02:29 Temps de traitement ▒coul▒ : 00:16:27 04/10/2016 09:02:29 --- FIN ETAT JOURNAL DES OPERATIONS PLANIFIEES 04/10/2016 09:02:29 ANS4023E Erreur lors du traitement de '/' : erreur d'entr▒e/sortie sur le fichier
The workaround that I'm thinking is to have a prospector that:
Filters out all the lines starting with ANS
Filters out the FIN line
Groups together using multiline everything after the DEBUT line.
But this assume you don't need the ANS lines. If you need those, you'll need a second Filebeat instance with a separate registry file to get them (and only them), because I think using multiple prospectors on the same file won't work well.
Thanks for the suggestion. I will try something like that, though I think it may not work since multilines are applied before the exclude/include_lines (as said in the documentation).
I think I will be able to make it work somehow anyway.
There is not much structural difference between lines to be composed into multiline and other lines. This makes matching a little hard.
All fields seem to be some 'numeric' stats , pointing to a regex pattern like ' : \d' for matching a collon followed by a single digit. This can be refined a little by including some keywords to reduce the chance of false-positives like: (total|transfert|objects|Temps de).+ : \d . The | operator means or introducing some kind of backtracking. Depending on match setting you use, you can modify the regex to also include '^--- FIN' or '^--- DEBUT'. For example
It can be kind of tricky to write a regex pattern like this without creating false-positives and might require some refinements, but it might be a start.
Feel free to create an enhancement request, as having a start/end-pattern like matcher would solve the problem much more robustly.
It did work, thanks again
The lines in the block were always the same, so I just went ahead and wrote them completely in the regex, to be sure it won't catch anything else.
Then I used an include_lines option to throw away all lines that don't start with --- DEBUT ETAT JOURNAL
Wow, you must have a really big regex by now. As you're filtering out anything not starting with --- DEBUT some false positives outside of your multiline shouldn't be bad at all.
Regex engines often require O(n^2) for string matching (the matcher searches for a substring) + backtracking all the patterns might be relatively expensive. I'd monitor filebeat CPU usage and check if using another regex would optimize resource usage a little. Having a pattern starting with '^' can create a one-pass regex in some cases.
e.g. '^.{20} ((--- DEBUT)|((Nombre|Le nombre|Rapport).+ (total|transfert|objects|Temps de).+ : \d))'
here I'm using ^.{20} to ignore the timestamp (exactly 20 characters) + introduce some early stop words like (Nombre|Le nombre|Rapport) + introduce some more 'evidence' for filtering out false positives: (total|transfert|objects|Temps de) and .+ : \d.
With () introducing capture groups the matcher might still capture them + throw the results away (depends on actual implementation). Using (?:) introduces a non-capturing group:
remember to verify resource usage when changing patterns. In case of CPU not being really different for different patterns use the most maintainable one.
My regexes wasn't actually really long (for either the multiline or the include_files)
The multiline regex was something like this :
^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert) etc. (can't c/c, not at work at the moment).
This worked and gave me the correct block starting with --- DEBUT ETAT JOURNAL and ending with FIN ETAT JOURNAL (note that I must add ETAT JOURNAL because there are other blocks that starts with DEBUT and end with FIN)
I then saw a lot of lines getting caught nevertheless, and those weren't following any patterns so I just decided to use the include_files option to trash them, like this : include_files : ['^[0-9]{2}\/[0-9]{2}\/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}: --- DEBUT ETAT JOURNAL']
I haven't seen the result of that include_lines yet since I need to wait for the server to generate some logs (it does every morning, so I will see tomorrow).
In any case, it's true that I didn't think about the CPU usage at all. Your regex is relly neat, it's doing basically the same thing with far less characters, I will go with something like that. The graph is also really helpful, greatly appreciated.
Thanks again! Will update when I get the final/optimized version in case someone else needs it.
Posting the pattern here in case it can help someone (who knows):
multiline pattern:
pattern: '^[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2} (--- FIN ETAT JOURNAL)|(Nombre total)|(Le nombre total)|(Dur.e de transfert)|(D.bit de transfert)|(Taux de compression)|(Rapport de r.duction)|(Temps de traitement)'
negate: false
match: after
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.