So I'm parsing VPC flow logs (the AWS flow logs) using the csv filter with the space char as the separator b/c I thought this would be efficient and I want to convert certain numeric fields to integers and, ideally, IPs to IPs.
This is not straightforward because this log format specifies that in some messages, some numeric fields are null and a null field contains a "-" char like these examples;
What I'm trying to work out is a conditional mutate that only evaluates true if the fields are numeric or do not contain the - char. I've had no luck following a conditional match with a mutate as my logstash 6.2.3 engine throws exceptions on that.
Next I tried making new fields with a conditional, thinking I would convert the derivative fields, like this;
With this statement, the fields are created, but they contain the "-" value as if the if statement is not evaluating true (I wanted to stop the "-" fields being copied into the new field because they can never be converted to integers.)
Maybe I need to use a number regex instead . What is the best way to do this? The biz logic would be
If the field contains a number, copy it to a new field and convert to an integer
If the field contains a - char, do nothing
If the field contains an IP address.. I am at a loss on how to approach this one as IPs need to be typed in the template, don't they, and I am uncertain how can we handle a minority of these containing the - char in a template?
Is copying to another field part of your actual processing logic or just something you need to handle the "-" values?
If it's the former latter, and the only possible values of these fields (both numeric and IP) are either valid or null ones, meaning you don't really expect to get a string value on the numeric fields e.g., the cleanest approach would be to:
Define their type on the ElasticSearch template side and let it handle the conversion
Delete any field that has a "-" so it ends up being null on ElasticSearch, like so:
So copying to another field was an attempt to filter out the “-“ characters before mutating the number strings in these fields to integers. I don't have to create derivitive fields. As for the contents of these fields, when not null, these fields like port, protocol, byte and packet count should always contain numeric strings if not null. The IP address fields should only contain IP address strings if not null.
I made somewhat of a typo in my earlier post, what I meant was that if you don't need field duplication you can use the above code to clean up any null values.
And since you will either receive null or legitimate values, that along with proper ElasticSearch template should cover you.
This approach seems to work well using the ruby filter example you gave with my test data; the populated fields get typed as integers and the hypen fields are dropped. I was a little reluctant to nuke fields out of concern about how this would display records but I don't think anything is being lost wrt to aggregations from what I can see. I also just realized Kibana displays null fields with the hyphen char so having fields with an actual hyphen char could get confusing. I'm not a big fan of space delimited fields like this in general..wouldn't it be simpler to use csv and make null fields null.
Elasticsearch handles missing fields graciously, so it shouldn't be any concern. Any field present in the template but not present in the actual document defaults to null.
And that approach is practically mandatory if you want to perform aggregations on numeric fields, since you cannot have multiple types defined for a specific field.
So all values should be either strings or integers or whatever, and since "-" cannot be mapped to an integer you would have to convert everything to strings, missing out on all numeric-specific aggregations in the process.
You could also convert null to 0 values but that would make little sense depending on the field semantics.
I'm not sure I follow you on this one. Can you tweak the way initial logs are generated or do you mean something else entirely?
The last comment is mostly philosophical reflection on space delimited log formats. This one is pretty simple though. A couple years ago I had to parse a security product log which used space delimeted messages for logs where some of the message fields contain spaces.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.