Filtering fields from Elasticsearch output

#1

I'm working on a project which involves processing a large number of files which follow almost the same format (several billion records).

I have a Grok filter working to extract the details I'm interested in, the tricky part is that I want to spit those lines that failed to be parsed by the Grok filter out to a file which in a separate location called failed-[original file name].

I have all of that working fine, but the problem I've run into is that the ElasticSearch output plugin is emitting the "filename" field I use to name the target file, and I don't want or need that data (or the overhead it presents) in my index.

Is there a way to suppress a field I'm generating with a Grok pattern from being emitted in ElasticSearch output whilst still leaving it available for the output plugin itself?

I did try setting my ES mapping to "dynamic": "strict" but rather than just emitting the fields I have in the mapping I get an exception thrown because the filename field is not part of the mapping.

The pipeline config is as follows;

input {
    file {
<snip>
    }
}

filter {
	# ignore empty lines
    if [message] =~ /^\s*$/ {
        drop{}
    }
	# match against custom pattern
    grok {
        patterns_dir => ["/etc/logstash/patterns"]
        patterns_files_glob => "*"
        match => { "message" => "%{CUSTOMPATTERN}" }
    }
	# get original filename
    grok {
        match => { "path" => "%{GREEDYDATA}/%{GREEDYDATA:filename}" }
    }
	# get batch name
    grok {
        match => { "path" => "%{GREEDYDATA}/%{GREEDYDATA:batch_name}/%{GREEDYDATA}\.%{GREEDYDATA}" }
    }
	# generate fingerprint for id
    fingerprint {
        key => "XXXXXXX"
        method => "SHA256"
        source => ["fielda", "fieldb"]
        target => "[@metadata][generated_id]"
    }
	# strip extraneous fields
    mutate {
        remove_field => ["@version", "@timestamp", "path", "message", "host"]
    }
	# if we're in the watch directory we don't have a batch name
    if [batch_name] == "watch" {
        mutate {
            remove_field => ["batch_name"]
        }
    }
}

output {
    if [fielda] and [fieldb] {
        elasticsearch {
            index => "indexa"
            hosts => ["XXXX:9200"]
            document_id => "%{[@metadata][generated_id]}"
	    # deprecated but LS complains if we don't have it
            document_type => "customtype"
        }
    } else {
        if [batch_name] {
            file {
                path => "/opt/ingestion/failed/%{batch_name}/failed-%{filename}"
                dir_mode => 0775
                file_mode => 0664
                codec => line { format => "%{message}" }
            }
        } else {
            file {
                path => "/opt/ingestion/failed/failed-%{filename}"
                dir_mode => 0775
                file_mode => 0664
                codec => line { format => "%{message}" }
	    }
        }
    }
}
#2

And of course, after having posted the question I immediately came upon the solution.

The @metadata fieldedit

In Logstash 1.5 and later, there is a special field called @metadata . The contents of @metadata will not be part of any of your events at output time, which makes it great to use for conditionals, or extending and building event fields with field reference and sprintf formatting.

So if we use [@metadata][filename] rather than filename we get the desired result.

(system) closed #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.