Grok Regex - Captured and Non-Captured Groups


(Brandon Hatch) #1

I have some data coming into Logstash that has an email address. What I want to do is create a new column with just the domain. This will allow us to better look for trends and issues by domain.

None of the existing Grok patterns seem to be of use so I created our own. The problem is that I can't get a clean domain name. I can get "@gmail.com", but I can't just get "gmail.com".
A really simple regex pattern is: @(...). In other languages I can specify which captured group to return. That way I can ignore the @ character and just return everything inside the parentheses. However in searching I haven't found a way in Grok to do that. It wants to return everything I have also tried using a non-captured group: (?:@)(...) but it returns the same results.

Has anyone ever encountered that before? Or know of a better way to write the regex so that it isn't needed?

Here is the conf file.
input {
stdin{
codec => json_lines
}
}
filter {
grok{
match => ["Email","(?(?:@)(...))"]
}
}
output {
file {
path => "/etc/logstash/results.txt"
}
}

I pass the following into Logstash
{"Email":"brandon@gmail.com"}

The output file shows:
{"Email":"brandon@gmail.com","@version":"1","@timestamp":"2016-02-10T18:09:04.043Z","host":"brandon-VirtualBox","domain":"@gmail.com"}

So what is the best way to have it exclude the @ character?


(Magnus B├Ąck) #2

You didn't format your configuration as code so it's impossible to tell exactly what your configuration looks like, but

@%{GREEDYDATA:domain}

will work if the email address is the only content of the field.


(Brandon Hatch) #3

The formatting looked to have been lost when I copied everything over. I just tried using your segment and I think that will work for us. The email is the only data in the column so we don't have to worry about where it ends. Thank you for your help.

Here is what I currently have it set to:

input {
        stdin{
                codec => json_lines
        }
}
filter {
        grok{
                match => {"Email" => "@%{GREEDYDATA:domain}"}
        }
}
output {
        file {
                path => "/etc/logstash/results.txt"
        }
}

The results were:
{"Email":"brandon@gmail.com","@version":"1","@timestamp":"2016-02-10T20:04:00.911Z","host":"brandon-VirtualBox","domain":"gmail.com"}


(system) #4