I would not expect a split of a field in [@metadata] to work due to this issue.
Instead of /%{DATA:dropFirstValue}/%{DATA:dropSecondValue}/ and then removing the fields you can just use /%{DATA}/%{DATA}/. Personally I would use /[^/]+/[^/]+
Thanks for the info. what other possibilities do we have to extract a message from 20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452
and check if the 7th word starts with /api?
Actually the split works just fine. Your grok in the case where [resource] starts with /api assumes 4 parts to the path, but in that message it only has three. Perhaps
grok { match => { "resource" => "(/[^\/]+)?/[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" } }
Actually the message format is variable and the resource part can have anywhere between 0-10 parts.
my understanding of the filter was that it drops the first two words no matter how many parts and if the message has 3 parts, the 3rd word would be parsed as repo and the 4th part would be an empty string
Is it possible to put a string length check here..like if the resource_path is 0 then display as empty string rather than a grok parse failure ?..or even give a default value to the field or remove that field?
You haven't really explained what the issue is. You gave an example, for which I gave you a possible solution, then said there are other possible inputs. That is not enough information to offer a solution to.
filter {
# Create a copy of original message
mutate {
add_field => {
"[@metadata][copyOfMessage]" => "%{[message]}"
}
}
# split message
mutate {
split => {
"[@metadata][copyOfMessage]" => "|"
}
}
if [@metadata][copyOfMessage][6] =~ /^\/api\// {
grok {
break_on_match => false
match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
}
grok {
break_on_match => false
match => { "resource" => "(/[^\/]+)?/[^\/]+/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
}
}
elseif [@metadata][copyOfMessage][6] =~ /^\/list\// or [@metadata][copyOfMessage][6] =~ /^\/simple\// {
grok {
break_on_match => false
match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
}
grok {
break_on_match => false
match => { "resource" => "/%{DATA:dropFirstValue}/%{DATA:repo}/%{GREEDYDATA:resource_path}" }
}
# Drop the first two values when the resource_path starts with "/list/" and "/simple/"
mutate {
remove_field => ["dropFirstValue"]
}
}
else{
grok {
# Enable multiple matchers
break_on_match => false
match => { "message" => "%{DATA:timestamp_local}\|%{NUMBER:duration}\|%{WORD:requesttype}\|%{IP:clientip}\|%{DATA:username}\|%{WORD:method}\|%{DATA:resource}\|%{DATA:protocol}\|%{NUMBER:statuscode}\|%{NUMBER:bytes}" }
# Extract repo and path
match => { "resource" => "/%{DATA:repo}/%{GREEDYDATA:resource_path}"}
}
}
# Extract resource name
grok {
break_on_match => false
match => { "resource_path" => "(?<resource_name>[^/]+$)" }
}
# Extract file extension
grok {
break_on_match => false
match => { "resource_path" => "(?<resource_type>[^.]+$)" }
}
# Parse date field
date {
timezone => "UTC"
match => [ "timestamp_local" , "yyyyMMddHHmmss" ]
target => "timestamp_object"
}
mutate {
add_field => { "time" => "%{time}"}
}
ruby {
code => "event.set('timestamp', event.get('timestamp_object').to_i * 1000);event.set('time',event.get('timestamp_object').to_i*1000000000 + rand(100000000))"
}
}
From what I see the value of repo is getting parsed differently for different messages.
For messages starting with /api/,the 3rd word has to be a repo but
20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group|HTTP/1.1|200|452 has a value of "test" as repo
and 20190815175019|9599|REQUEST|14.56.55.120|anonymous|POST|/api/test/Lighter-test-group/2.0|HTTP/1.1|200|452 has the right repo value which is "Lighter-test-group"
If i dont use the regex provided, if the resource has three parts, we have a grok parse failure but anything more than 3 parts is working fine
so there is no grok parse failure if the message has trailing "/"
i.e,
/api/test/Lighter-test-group to /api/test/Lighter-test-group/
and
/list/Lighter-test-group to /list/Lighter-test-group/
from which I understand that empty string or space works fine with DATA/GREEDYDATA, but do we have a workaround solution on how to tackle this?
For /api/test/Lighter-test-group/2.0 , repo= Lighter-test-group
For /api/test/Lighter-test-group , repo = test
For /list/Lighter-test-group , repo = list
For /api/test/Lighter-test-group/2.0/1.2 , repo = Lighter-test-group
For /list/Lighter-test-group/xyz/123 , repo = Lighter-test-group
There are no grokparsefailures, but messages are getting parsed differently. For messages starting with list or api..repo value is getting parsed differently. All the messages need to be parsed as repo=Lighter-test-group
Since we made the resource_path optional, is it expected to throw an error if doesnot exist ? Is there any way we can parse the repo's and remove the error message ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.