Logstash : Multi Pipelines Performances

Oxyds · May 17, 2019, 11:36am

Hi, I am using Logstash to parse IIS logs and to ingest some metrics about the system of my servers. So I am using Filebeat to send the logs and Metricbeat to send the metrics. However,to allow the analysis of future new data sources, I decided to create multiple pipelines with pipeline-to-pipeline communication following the distributor pattern with this configuration (pipelines.yml) :

- pipeline.id: main
  path.config: "/etc/logstash/main_pipeline.conf"
  pipeline.workers: 1

- pipeline.id: filebeat
  path.config: "/etc/logstash/filebeat_pipeline.conf"
  pipeline.workers: 6

- pipeline.id: metricbeat
  path.config : "/etc/logstash/metricbeat_pipeline.conf"
  pipeline.workers: 1

My problem is that the performances dropped from 10 000 doc/sec while i'm using one big pipeline to 500 doc/sec or so with multiple pipelines even if I change the numbers of workers allocated to each pipelines.

I don't know what i'm doing wrong and i can't find beginnings of explanation or research leads.

Can someone please help me ?

The configuration of the single big pipeline is the following one :

- pipeline.id: All
  path.config: "/etc/logstash/AllInOne_pipeline.conf"
  pipeline.workers: 8

For both of the configuration, I'm running Logstash with 2GB heap size.

Here is the configuration of each pipelines :

Main :

input {
    beats {
        port => 5044
    }
}
filter {
        # Save some fields, remove the others ones, restaure the old one (don't need to change mapping)
    mutate { 
        add_field => {
             "[aux][id]"       => "%{[agent][id]}"
             "[aux][type]"     => "%{[agent][type]}"
             "[aux][hostname]" => "%{[agent][hostname]}"
         }
    }
    prune {
        blacklist_names => ["^ecs", "^host", "^metricset", "^agent","^event"]
    }
    mutate { 
        add_field=> {
            "[agent][id]"       => "%{[aux][id]}"
            "[agent][type]"     => "%{[aux][type]}"
            "[agent][hostname]" => "%{[aux][hostname]}"
        }
    }
    prune {
        blacklist_names => ["^aux"]
    }
 }
 output {
     if [agent][type] == "metricbeat" {
        pipeline {
             send_to => metricbeat
        }
    }
    else if [agent][type] == "filebeat" {
        pipeline {
            send_to => filebeat
        } 
    }
 }

Filebeat :

input {
    pipeline {
        address => filebeat
    }
}
filter {
    if [message] =~ "^#" {
        drop { }
    }
    grok {
        match  { "message" => *grok expression working fine*}
        remove_field => ["message"]
    }
    date {
        match => [ "Timestamp", "yyyy-MM-dd HH:mm:ss"]
        timezone => "UTC"
    }
}
output {
    elasticsearch {
            hosts => ["xx.xx.xx.xx:9200"]
            index => "iis-%{+yyyy.MM.dd}"
            template_name => "iis"
    }
}

Metricbeat :

input {
        pipeline {
                address => metricbeat
        }
}
filter {
        mutate {
                add_field => {
                        "[aux1]" => "%{[event][dataset]}"
                        "[aux2]" => "%{[event][module]}"
                }
        }
        prune {
                blacklist_names => ["^event"]
        }
        mutate {
                add_field => {
                        "[event][dataset]" => "%{[aux1]}"
                        "[event][module]"  => "%{[aux2]}"
                }
        }
        prune {
                blacklist_names => ["^aux"]
        }
}
output {
        elasticsearch {
                hosts => ["xx.xx.xx.xx:9200"]
                index => "metricbeat-%{+xxxx.ww}"
                template_name => "metricbeat-7.0.0"
        }
}

All :

input {
    beats {
        port => 5044
    }
}
filter {
            # Save some fields, remove the others ones, restaure the old one (same     mapping)
    mutate { 
        add_field => {
             "[aux][id]"       => "%{[agent][id]}"
             "[aux][type]"     => "%{[agent][type]}"
             "[aux][hostname]" => "%{[agent][hostname]}"
        }
    }
    prune {
        blacklist_names => ["^ecs", "^host", "^metricset", "^agent","^event"]
    }
    mutate { 
        add_field=> {
            "[agent][id]"       => "%{[aux][id]}"
            "[agent][type]"     => "%{[aux][type]}"
            "[agent][hostname]" => "%{[aux][hostname]}"
        }
    }
    prune {
        blacklist_names => ["^aux"]
    }
    if [agent][type]== "metricbeat" {
        mutate {
            add_field => {
                 "[aux1]" => "%{[event][dataset]}"
                 "[aux2]" => "%{[event][module]}"
              }
        }
        prune {
            blacklist_names => ["^event"]
        }
        mutate {
            add_field => {
                "[event][dataset]" => "%{[aux1]}"
                "[event][module]"  => "%{[aux2]}"
            }
        }
        prune {
            blacklist_names => ["^aux"]
        }
    }
    else if [agent][type] == "filebeat" {
        if [message] =~ "^#" {
            drop { }
        }
        grok {
            match  { "message" => *Same grok expression working fine*}
            remove_field => ["message"]
        }
        date {
            match => [ "Timestamp", "yyyy-MM-dd HH:mm:ss"]
            timezone => "UTC"
        }
    }
 }
 output {
     if [agent][type] == "metricbeat" {
            elasticsearch {
                    hosts => ["xx.xx.xx.xx:9200"]
                    index => "metricbeat-%{+xxxx.ww}"
                    template_name => "metricbeat-7.0.0"
            }
    }
    else if [agent][type] == "filebeat" {
            elasticsearch {
                    hosts => ["xx.xx.xx.xx:9200"]
                    index => "iis-%{+yyyy.MM.dd}"
                    template_name => "iis"
            }
    }
}

Finally, the logstash.yml :

path.data: /var/lib/logstash
path.logs: /var/log/logstash

I'm working with Debian 9.8 ,4GB RAM and 8 cores.

(Sorry for my bad english )

Badger · May 17, 2019, 1:08pm

I cannot speak to the performance impact of using multiple pipelines, but there are some things in your filters that can probabaly be improved.

    prune {
            blacklist_names => ["^event"]
    }

Causes prune to iterate over every field in the event and remove from the hash any fields whose names start with event. It appears from the preceding filter that the [event] field is an object. So

mutate { remove_field => [ "event" ] }

will remove the [event] field and all the sub-fields without iterating over every other event in the field, so I would expect it to be cheaper.

Also, you are using [auxN] to save fields and then restoring them and pruning the auxN fields. Instead of this I would use mutate+add_field to copy fields from [event] to subfields of [@metadata][event], then mutate+remove_field to remove [event], and mutate+copy to copy the entire [@metadata][event] field back into the main event. Like this

    mutate { add_field => { "[event][a]" => 1 "[event][b]" => 2 "[event][c]" => 3 } }
    mutate { add_field => { "[@metadata][event][a]" => "%{[event][a]}" "[@metadata][event][c]" => "%{[event][c]}" } }
    mutate { remove_field => [ "[event]" ] }
    mutate { copy => { "[@metadata][event]" => "[event]" } }

No need to remove [@metadata][event], it gets discarded by the output. Again this should be faster because it avoids iterating over all the fields. Do not try to combine all the mutate filters into one, mutate does mutations in a fixed order and it is unlikely to be the one you want.

pastechecker · May 17, 2019, 2:02pm

Did you tried on your machine with the output Isolator pattern and persistent queues?
I wonder how does that would in comparison to the distributor pattern in your case?
If you let Logstash to assume the defaults commenting out the pipeline.workers:1 what performance do you get?
Do you have default jvm.heap settings?

Oxyds · May 20, 2019, 7:37am

Thanks for your reply

It does improve the performances but just by a tiny bit (550 doc/sec now).

Oxyds · May 20, 2019, 7:49am

I did no try the Output Isolator pattern since i'm sending the documents to the same Elasticsearch instance. I just want to parse the documents depending on the source.

However, if I let Logstash assume the default settings, the performances skyrocket to 30 000 doc/sec
I did a mistake by not trying it first. Sorry...
But, well, i don't understand this difference in the performances at all...

Finally, apart from the heap size, I do use the default jvm.heap settings.

Thanks for your help

system · June 17, 2019, 7:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to improve Performance Logstash	6	1224	December 12, 2017
Problems with logstsah multiple pipelines Logstash	3	1801	May 4, 2018
Multipipelines or one pipeline with lot of filters - Performance Logstash	2	862	June 22, 2018
Logstash can't handle multiple pipline Logstash	4	651	September 15, 2017
Logstash 6 - Multiple Pipelines for One Input Logstash	10	9271	December 16, 2017

Logstash : Multi Pipelines Performances

Main :

Filebeat :

Metricbeat :

All :

Related topics