Speed up the pipeline-2

I have set up of 3 machines:

  1. Single ES process setup with 20gb heap size and 16 processors
  2. a logstash process with 20gb heap size connecting to the ES on machine 1
  3. filebeat pumping data to the logstash process on machine 2

Increasing the heap size on ES fron 2gb default to 20gb did not speed up the pipeline. Splitting the log file into multiple smaller parts to increase filebeat parallelism didn't help. Neither did increase the bulk_max_size to 8192. Any suggestions on how I can speed up the pipeline?

On the elasticsearch box, I got the following stats:

"cpu" : {
"percent" : 5,
"load_average" : {
"1m" : 2.93,
"5m" : 2.16,
"15m" : 2.04
}
}

"mem" : {
"free_percent" : 89,
"used_percent" : 11
}

"threads" : {
"count" : 308,
"peak_count" : 315
}

On the logstash box, the hottest thread was the following:
64.53 % of cpu usage, state: waiting, thread name: 'nioEventLoopGroup-4-1', thread id: 155
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)

The same thread id remained the hottest throughout. It started from 11% and peaked to 65%.

The stats obtained are as follows:
"jvm" : {
"threads" : {
"count" : 136,
"peak_count" : 138
}

"process" : {
"open_file_descriptors" : 770,
"peak_open_file_descriptors" : 771,
"max_file_descriptors" : 4096,
"cpu" : {
"total_in_millis" : 9160760,
"percent" : 0,
"load_average" : {
"1m" : 0.18,
"5m" : 0.82,
"15m" : 1.12
}
}
}

The JVM memory heap usage[i.e. "heap_used_percent"] kept growing from 1-2% till 13%, which probably were the GC cycles.

As seen from stats, neither CPU nor RAM is being fully utilized. So, all available resources are under utilized, yet the pipeline is getting any faster. What am I missing?

Establish a baseline.

On Logstash machine config remove all filters and outputs, just have beats input and stdout output with the dots codec.

input {
  beats {
    ...
  }
}

output {
  stdout { codec => "dots" }
}

Use the LS monitoring API to collect pipeline stats.

Wait for at least 3 minutes (JVM warmup) before doing an API call. Use API calls infrequently, they do affect the performance slightly.
Use "out" / "duration_in_millis" to get the kEPS (events per second). For more accuracy, make one API call at 3 minutes and another ~ 1 minute later. Subtract measurement 2 from 1 and then divide them.

Add your first filter into the config and measure perf again. Continue this systematic exercise until you find out which plugin is significantly slowing the throughput.

You might like to post your results here.

@guyboertje thanks for the prompt response. I was just reading your response at _node/stats - duration_in_millis vs queue_push_duration_in_millis

I have the pipeline stats for my last run. So, I am posting that for now.

"pipeline" : {
    "events" : {
      "duration_in_millis" : 142914507,
      "in" : 57297971,
      "out" : 57297971,
      "filtered" : 57297971,
      "queue_push_duration_in_millis" : 272075
    },
    "plugins" : {
      "inputs" : [ {
        "id" : "fab0b744ca08b9cc7117f287f8d87e4dddfcadeb-2",
        "events" : {
          "out" : 57297971,
          "queue_push_duration_in_millis" : 272075
        },
        "current_connections" : 0,
        "name" : "beats",
        "peak_connections" : 2
      } ],
      "filters" : [ {
        "id" : "fab0b744ca08b9cc7117f287f8d87e4dddfcadeb-4",
        "events" : {
          "duration_in_millis" : 179249,
          "in" : 57297971,
          "out" : 57297971
        },
        "name" : "geoip"
      }, {
        "id" : "fab0b744ca08b9cc7117f287f8d87e4dddfcadeb-3",
        "events" : {
          "duration_in_millis" : 1827832,
          "in" : 57297971,
          "out" : 57297971
        },
        "matches" : 57295891,
        "failures" : 2080,
        "patterns_per_field" : {
          "message" : 1
        },
        "name" : "grok"
      }, {
        "id" : "fab0b744ca08b9cc7117f287f8d87e4dddfcadeb-5",
        "events" : {
          "duration_in_millis" : 499858,
          "in" : 57297971,
          "out" : 57297971
        },
        "matches" : 57295891,
        "name" : "date"
      } ],
      "outputs" : [ {
        "id" : "fab0b744ca08b9cc7117f287f8d87e4dddfcadeb-6",
        "events" : {
          "duration_in_millis" : 4794278,
          "in" : 57297971,
          "out" : 57297971
        },
        "name" : "elasticsearch"
      } ]
    }

It seems grok is the plugin that is taking time [Correct me if I am wrong].

I will continue the process of elimination that you suggested. Meanwhile, I would like to have your suggestion on how this pipeline can be sped up. As the RAM and CPU usage is not much, I can add more logstash processes on the same box. Will that work?

You are correct, grok seems to be the slow filter, but the elasticsearch output is the slowest.

geoip   -> 57297971 /  179249 => 319.6 kEPS
grok    -> 57297971 / 1827832 =>  31.3 kEPS
date    -> 57297971 /  499858 => 114.6 kEPS

es out  -> 57297971 / 4794278 =>  11.9 kEPS

Put in terms of the micro-seconds per event time.

geoip   ->   179249 / 57297971 =>  3.1 us per event
grok    ->  1827832 / 57297971 => 31.9 us per event
date    ->   499858 / 57297971 =>  8.7 us per event
es out  ->  4794278 / 57297971 => 83.6 us per event

Grok performance can be improved with one simple change, use anchors.

But you will really see an improvement if you speed up the ES output side.

Do you need grok? Can you use a Dissect + Grok combo?

1 Like

I modified the index refresh rate to 30s and set indices.recovery.max_bytes_per_sec to 100mb as I am using SSD. I also dropped geoip filter. The numbers I got are not very much of an improvement. Especially, the critical component, i.e. ES output did not improve much.

component out	millis	     time is micro for each event
grok	57298382	1681016	 29.33793139
date	57298382	431193	 7.525395743
ES out	57298382	4730925	 82.56646758

Also, why does the hot threads API show a CPU usage that keeps increasing for the same thread id? What does it indicate?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.