Removing special characters from field name

if "Holding Dealer Organic ID"                                  
{                                                               
        mutate { add_tag => ["holding_dealer"] }                
        mutate { lowercase =>  ["Holding Dealer Organic ID"] }  
}                                                               

lowercase is not working nor is the rename function but the tag gets added so the 'if' matches. I believe it is because of the special characters only viewable via cat -e filename. This is what the field name looks like in the raw file.

M-oM-;M-?Holding Dealer Organic

I've tried using 'sed' to remove these characters but it is making a mess of other fields. My question is how can i remove these characters in logstash so that i can actually make us of mutate functionality on this field?

Thanks

That's a byte order mark. I would do this in ruby

ruby {
    code => '
        event.to_hash.each { |k, v|
            newk = k.someFunction()
            event.set(newk, v)
        }
    '
}

In your case .someFunction might be a straight gsub of the BOM to "", or you might want to remove all control characters and characters above 128 (code available here), or you might want to be much stricter and go with something like gsub!(/[-_a-zA-Z0-9])

Edited to add ... if your events contain a byte order mark then your input is probably not set to consume UTF-8. I would expect (but have not tested) that changing the encoding on the input would not only remove the BOM, but also ensure you get the right representation for any other obscure characters in the events (field values as well as names). Just in case someone sends some Simplified Chinese your way :slight_smile:

1 Like

I decided to just fix the source text. I appreciate the ruby code though.

sed -i '1 s/^.//' $input`

Removes the BOM and life is grand once again. Found this burried on a stackoverflow page.

Just to add more possibilities - sometimes the regex \xEF\xBB\xBF (more info) can be useful to work with BOM within logstash or filebeat configurations.


I had a simpler use case (no need to mess with field names) where the 1st line of a file must be discarded. UTF-8 codec didn't help and hide the BOM, maybe an issue of working with files between different systems (windows / linux). The line starts with the BOM and some static content.

That can be achieved in logstash with regex. I chose to enclose the pattern \xEF\xBB\xBF in a non-capturing group (?: ... ) whose presence is optional ? and can be found just after the beginning ^ of the line:

# logstash: drop messages that start with BOM
if [message] =~ /^(?:\xEF\xBB\xBF)?contents_of_1st_line_that_must_be_excluded.*/ {
  drop { }
}

or filebeat configuration:

filebeat.prospectors:
- input_type: log
  paths:
    - path/to/files/*.csv
  exclude_lines: ['^(?:\xEF\xBB\xBF)?contents_of_1st_line_that_must_be_excluded.*', 'other patterns']
  encoding: utf-8
2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.