Starting Point to Tune Logstash

Hi there

i have a dedicated logstash machine with 16 cores and 32 Gb RAM and multiple pipelines inside it. There are 3 heavy pipelines(complex filters and high event rate).

what is a good starting point to make a global config in logstash.yml for each pipeline and a custom config for those 3 heavy pipelines in pipelines.yml?

The total pipeline in my machine is 26 (including those 3)

Should I give the default value (16 and 125) for pipeline.workers and pipeline.batch.size in logstash.yml?

Thanks

If you have a pipeline worker thread with work to do then it needs a CPU to execute on. If you have 26 pipelines, each with multiple worker threads, then as the system runs out of CPU it will spend more and more time thread switching. As a result it will not scale linearly as load increases.

My advice is to minimize the number of workers threads (possibly as low as 1 per pipeline). If you feel some pipelines need additional workers then add them and measure the system at high load to see whether the additional workers cause an increase or decrease in throughput.

Noted, thank you

But do you think I should also adjust the pipeline.batch.size or keep it default for the starting point?

I would start with the default value.

ok, thank you so much. i would start tuning with that config

Also you should monitor LS statistics. You can see how events will be processed if you change parameter/s.

hi @Rios i followed your suggestion and I found that logstash took a long time to publish an event to Elastic. As you can see here, I separate the output config for the index that will be monthly, weekly, or daily indices:

        "outputs" : [ {
          "id" : "ocp-daily",
          "documents" : {
            "dlq_routed" : 1216,
            "successes" : 674948
          },
          "bulk_requests" : {
            "with_errors" : 1163,
            "successes" : 75063,
            "responses" : {
              "200" : 76226
            }
          },
          "events" : {
            "duration_in_millis" : 3444186,
            "in" : 676164,
            "out" : 676164
          },
          "name" : "elasticsearch",
          "flow" : {
            "worker_millis_per_event" : {
              "current" : 6.683,
              "last_1_minute" : 5.304,
              "last_5_minutes" : 5.224,
              "last_15_minutes" : 5.587,
              "last_1_hour" : 5.334,
              "lifetime" : 5.094
            },
            "worker_utilization" : {
              "current" : 7.488,
              "last_1_minute" : 7.563,
              "last_5_minutes" : 6.668,
              "last_15_minutes" : 6.895,
              "last_1_hour" : 7.046,
              "lifetime" : 6.96
            }
          }
{
          "id" : "ocp-monthly",
          "documents" : {
            "dlq_routed" : 357,
            "successes" : 223974
          },
          "bulk_requests" : {
            "with_errors" : 268,
            "successes" : 62649,
            "responses" : {
              "200" : 62917
            }
          },
          "events" : {
            "duration_in_millis" : 2218478,
            "in" : 224331,
            "out" : 224331
          },
          "name" : "elasticsearch",
          "flow" : {
            "worker_millis_per_event" : {
              "current" : 10.32,
              "last_1_minute" : 10.4,
              "last_5_minutes" : 9.649,
              "last_15_minutes" : 9.987,
              "last_1_hour" : 10.21,
              "lifetime" : 9.889
            },
            "worker_utilization" : {
              "current" : 4.293,
              "last_1_minute" : 4.5,
              "last_5_minutes" : 4.232,
              "last_15_minutes" : 4.446,
              "last_1_hour" : 4.509,
              "lifetime" : 4.483
            }
          }
        }

I don't understand why logstash takes so long to publish an event. you can see that logstash needs 10ms to publish each event. my configuration was simple. What else can I tune from this?

 if [merge] == "month" {
  elasticsearch {
    id => "ocp-monthly"
    hosts => ["https://host1:9215", "https://host2:9215", and so on]
    ssl_certificate_authorities => '/logstash/logstash-8.18.2/config/certs/ca.crt'
    ssl_verification_mode => 'none'
    index => "ocp-sby-%{[kubernetes][namespace]}-%{+YYYY.MM}"
    document_id => "%{[custom_id]}"
    user => '${ES_USER}'
    password => '${ES_PWD}'
   }
 }
 else if [merge] == "weekly" {
  elasticsearch {
    hosts => ["https://host1:9215", "https://host2:9215", and so on"]
    ssl_certificate_authorities => '/logstash/logstash-8.18.2/config/certs/ca.crt'
    ssl_verification_mode => 'none'
    index => "ocp-sby-%{[kubernetes][namespace]}-%{+xxxx.ww}"
    document_id => "%{[custom_id]}"
    user => '${ES_USER}'
    password => '${ES_PWD}'
   }
 }
 else {
  elasticsearch {
    id => "ocp-daily"
    hosts => ["https://host1:9215", "https://host2:9215", and so on"]
    ssl_certificate_authorities => '/logstash/logstash-8.18.2/config/certs/ca.crt'
    ssl_verification_mode => 'none'
    index => "ocp-sby-%{[kubernetes][namespace]}-%{+YYYY.MM.dd}"
    document_id => "%{[custom_id]}"
    user => '${ES_USER}'
    password => '${ES_PWD}'
   }
}

I took a quick look at the monitoring cluster, and it turns out that almost all pipelines have been publishing events for a long time to Elastic

My elastic cluster consists of 7 Data nodes with 16GB of heap each.

Thanks

It's not easy to get details from your posts. To summarize:

  • 23 pipelines + 3 under heavy loads on 16 cores&32 GB memory

  • ~10ms is event processing time for heavy loads

  • everything is on 1 LS host

  • pipeline.workers and pipeline.batch.size are default?

pipeline.workers # LS will use max 16
pipeline.batch.size: 125
pipeline.batch.delay: 5
pipeline.ordered: auto
  • What are your values Xms and Xmx in jvm.options?
  • Are all settings the same for pipelines? What is specific for those 3 pipelines?
  • Are you using memory or persistent queue?
  • How many data ES nodes are using?
  • Have you checked ES logs? Especially because slow insert and DLQ.
  • What is avg/max the message in those h loaded pipelines?

Since you said "my configuration was simple", there is not much code in filters, you can try with:
Edit: make backup before any changes

pipeline.batch.size: 250 # only in heavy loads
compression_level => 5 # or increase to reduce load
ssl_enabled => true # yes, should be by default
pool_max => 2000 # increase to reduce reopening
pool_max_per_route => 200
ssl_supported_protocols  => "TLSv1.3" # use only 1.3, should be faster to establish sec channel
resurrect_delay => 2
  • Exclude dedicated master nodes from list
  • Check ES logs(all nodes), why are you getting dlq_routed, you can do it manually or metricbeat or agent
  • Use sniffing mode, check this thread.
  • Investigate LS statistics for all pipelines
  • Check value for tcp_keepalive_time, only check, do not touch on OS level.
  • Allocate 2-3 nodes only to heavy loaded pipelines, other pipelines should

This is not simple optimization activity since it's on live data&load, where Jedi council don't have full information or access. I truly hope other Jedi will give own opinion.

Have you used live pipelines monitoring in Kibana? If not already exists, should set it

PUT _cluster/settings
{
  "persistent": {
    "xpack.monitoring.collection.enabled": true
  }
}

I don't know if it is relevant but I had to change open file limit on linux for heavy load pipeline.

cat /usr/lib/systemd/system/logstash.service |grep NOFILE
LimitNOFILE=66384

default is 16k

1 Like

I have 2 logstash server

This configuration is in logstash.yml
workers = 8
batch size = 125
delay = default

in pipelines.yml, I made a custom config for those heavy pipelines

- pipeline.id: api
  path.config: /logstash/logstash-8.18.2/pipelines/api.conf
  pipeline.workers: 10
  pipeline.batch.size: 256
  queue.type: persisted
  queue.max_bytes: 8gb
- pipeline.id: ocp-sby
  path.config: /logstash/logstash-8.18.2/pipelines/ocp-sby.conf
  pipeline.workers: 10
  pipeline.batch.size: 256
- pipeline.id: ocp-jkt
  path.config: /logstash/logstash-8.18.2/pipelines/ocp-jkt.conf
  pipeline.workers: 10
  pipeline.batch.size: 256

16GB

I put all those 7 data nodes in all the pipelines. Of course, 3 of them were master-eligible nodes too. I don't have dedicated master nodes.

The log seems fine. From one of the master-eligible nodes, I don't see anything that causes Elastic to slow down from doing its job, such as frequent GC or something similar. Most of it was an error mapping log, and I don't activate DLQ either

From the monitoring cluster, I can see:
For the API pipeline, around 7k events per second
For OCP pipelines, around 3k events per second

Unfortunately, I don't have it in my cluster, and I may try the pool configuration first to see if it has an effect. thanks

Don't worry, my monitoring cluster already has it