How to filter data with Logstash before storing parsed data in Elasticsearch

I understand that Logstash is for aggregating and processing logs. I have NGIX logs and had Logstash config setup as:

filter {
 grok {
   match => [ "message" , "%{COMBINEDAPACHELOG}+%{GREEDYDATA:extra_fields}"]
   overwrite => [ "message" ]
 }
 mutate {
   convert => ["response", "integer"]
   convert => ["bytes", "integer"]
   convert => ["responsetime", "float"]
 }
 geoip {
   source => "clientip"
   target => "geoip"
   add_tag => [ "nginx-geoip" ]
 }
 date {
   match => [ "timestamp" , "dd/MMM/YYYY:HH:mm:ss Z" ]
   remove_field => [ "timestamp" ]
 }
 useragent {
   source => "agent"
 }
}

output {
 elasticsearch {
   hosts => ["localhost:9200"]
   index => "weblogs-%{+YYYY.MM}"
   document_type => "nginx_logs"
 }
 stdout { codec => rubydebug }
}

This would parse the unstructured logs into a structured form of data, and store the data into monthly indexes.

What I discovered is that the majority of logs were contributed by robots/web-crawlers. In python I would filter them out by:

browser_names = browser_names[~browser_names.str.\
                              match('^[\w\W]*(google|bot|spider|crawl|headless)[\w\W]*, na=False)]

However, I would like to filter them out with Logstash so I can save a lot of disk space in Elasticsearch server. Is there a way to do that? Thanks in advance!

If there is a browser_names field on the event then something like

if [browser_names] =~ /^[\w\W]*(google|bot|spider|crawl|headless)[\w\W]/ {
    drop {}
}

Thx so much! It works like a charm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.