Filter and count indeterminate number of not-known keywords

pup_seba · April 7, 2019, 8:28am

Hi,

Although I saw some posts with similar questions, the answers (at least for the posts I saw), did not apply. So first of all, sorry if I'm duplicating an issue/post.

I'm trying to filter and work on some logs. This log will use lines like the one below:

Apr 6 20:46:07 amavis[59327]: (59327-01) spam-tag, <sgreco@domain.com> -> <sebas@domainn.cat>, No, score=2.998 required=6.6 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=no autolearn_force=no

The part I need help with, is this:

tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001]

The difficulties for me here are:

The amount of spamassassin filters (HTML_MESSAGE, RCVD_IN_DNSWL, etc) will change from log line to log line.
There is no way to know before hand what the name of these spamassassin filters will be.
There is no way to know before hand what the score for each of these spamassassin filters will be ("3", "0.0001", etc).

My final objective is to have 2 visualizations in kibana that shows how many times each filter has been used and how much overall score each filter gave.

I don't know how to filter something like this, or what would be the best way to store it elasticsearch either.

Could you please help me to sort this out?

Badger · April 7, 2019, 2:52pm

Does this help?

    dissect { mapping => { "message" => "%{}tests=[%{[@metadata][tests]}]%{}" } }
    kv { source => "[@metadata][tests]" field_split => ", " trim_key => " " target => "[@metadata][filters]" }
    ruby {
        code => '
            event.get("[@metadata][filters]").each { |k, v|
                event.set(k, v.to_f)
            }
        '
    }

pup_seba · April 7, 2019, 3:38pm

Hi Badger!

This is my first contact with dissect, kv and ruby filters. I just read the documenation about them after I saw your comment.

It seems to me (please correct me if I'm wrong), that:
dissect: You are using it to extract the relevant part of that line, and save it in this "[@metadata][tests]". The syntaxis you are using to do it is:
%{} -> Whatever is there, kind of a wildcard? (.*)
tests=[ -> A literal
[@metadata][tests] -> No idea what this is Although it looks like the syntaxis used for logstash outputs. It seems to operate as a placeholder for processing purpuses. With the metadata part you are telling it that "no need to save this as a field" and with the [tests] you are just giving a name for the metadata for later processing?
] -> A literal
%{} -> Same as before, a wildcard for whatever is there from the previous lieteral to the end.

I can't find this kind of "@metadata" in the documentation of the filter, I'm sorry. It would be great for me to learn to use this, thus your confirmation of my interpretation is appreciated. Even more, could you point me to some documentation that helps me understand it?

kv: You are using it to process the previous obtained information, which you saved in this [@metadata][tests] thing
source => "[@metadata][tests]" -> According to documentation, it refers to the "field" to be used, so my guess is that I could use a field previously extracted by a grok filter maybe? I ask you this because I do have this grok filter in place, which already extracts the "tests" part along with other things. So maybe I could use a "metadata" too with my grok and use it instead of the dissect? Does this make any sense? (this is the grok filter I'm using btw):

^amavis\[[\d]*\]: \([\d-]*\) [\w-]*, <%{EMAILADDRESS:sender}> -> <%{EMAILADDRESS:rcpt}>, (?<isspam>(Yes|No)), score=(?<spamscore>[\d\.]*) required=[\d\.]* tests=\[(?<tests>[^]]*)\] autolearn=(yes|no|disabled) autolearn_force=(yes|no|disabled)$

field_split => ", " -> You are stablishing the comma + space as delimiters.
trim_key -> Used here to remove the whitespaces...which I don't fully understand why is necessary as whitespace is part of the delimiter.

target => "[@metadata][filters] -> The result of the previous operation, will end up in another "metadata" thing named filters. But I don't see exactly what it would look like, I mean, will it take the form of "several fields", each of them is named the same as the filter name (ie DKIM_SIGNED) with a value of its scores (ie 0.1)?

ruby: You have me at hello with this one...I can't see what it does, sorry. It looks like you are joining the name of the filter along with its value, but idk.

While reading the kv documentation, I saw a "recursive" option. Did you ever use that option? It seems to be something that could help here?

Thank you!

Badger · April 7, 2019, 4:13pm

Yeah, that's a way of doing it using grok rather than dissect. You reading of the dissect filter is correct.

Using a target which is a sub-field of [@metadata] allows you to attach data to the event but does not index the data with the event. It's useful for storing intermediate formats. Not sure where that is documented.

If you use metadata => true on a rubydebug codec you will see

         "@metadata" => {
      "tests" => "DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001",
    "filters" => {
        "RCVD_IN_DNSWL_NONE" => "-0.0001",
                  "SPF_PASS" => "-0.001",
                "DKIM_VALID" => "-0.1",
               "DKIM_SIGNED" => "0.1",
              "HTML_MESSAGE" => "3",
         "RCVD_IN_MSPIKE_H2" => "-0.001"
    }
},

I don't think the trim_key is needed if you use space in the field_split.

The ruby filter says to fetch [@metadata][filters] from the event. That is a hash, and the .each says to iterate over each key/value pair. A key/value pair would be something like "RCVD_IN_DNSWL_NONE" and "-0.0001". v.to_f converts the string "-0.00001" to a float and then event.set adds it to the event, so that you end up with the following fields on the event

          "SPF_PASS" => -0.001,
        "DKIM_VALID" => -0.1,
       "DKIM_SIGNED" => 0.1,
      "HTML_MESSAGE" => 3.0,
"RCVD_IN_DNSWL_NONE" => -0.0001,
 "RCVD_IN_MSPIKE_H2" => -0.001

pup_seba · April 7, 2019, 4:17pm

Thank you so much for your help!

I'll try to put the pieces together now

system · May 5, 2019, 4:17pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.