Logstash ingestion performance for log management

Hi guys,

I have some issues with the performance of the ingestion of data into Elasticsearch through Logstash.

My ingestion setup is the following:

  • Data ingested are log lines read from file through Filebeat
  • Log lines read are sent to Logstash for parsing in order to enrich them (in particular a kv filter plugin is used to extract key-value pairs from the log message body)
  • Logstash sends the enriched events to Elasticsearch

My project requires a near real-time availability on Elasticsearch of the enriched log messages.
By near real-time I mean delays not greater than 10 seconds. Instead, I usually get delays of 30 seconds to a few minutes, and sometimes even delays of the order of 10 minutes, which are totally not acceptable.

Architecture is the following:

  • multiple Filebeat istances which are installed on the machines where logs are written
  • 1 dedicated host where I put the Elasticsearch node (my cluster is composed of just 1 data node) and 1 instance of Logstash

I am not allowed to install more Elasticsearch nodes nor more Logstash nodes. For the moment I am stuck with 1 ES node and 1 Logstash node.

I have set the following properties for the pipeline:

   pipeline.workers: 4
   pipeline.batch.size: 100
   pipeline.batch.delay: 1
   queue.type: memory

In order to monitor the performance of the entire ingestion process, I activated X-Pack Monitoring for Elasticsearch, Logstash and Filebeat, thus I'm able to see metrics like:

  • Indexing Rate and Indexing Latency for Elasticsearch
  • Event Emitted Rate and Event Latency for Logstash
  • Event Emitted Rate for Filebeat.

I tried to get some information about the typical elaboration time inside Logstash filter by adding "start" and "end" times at the beginning and end of the filter section, like in the following:

input {
	beats {
		port => 5044
	}
}
filter {
	ruby {
		init => "require 'time'"
		code => 'event.set("logstash_start_timestamp", Time.now())'
	}

	### START USUAL ELABORATION

	kv {
		...
	}
		
	### END USUAL ELABORATION

	ruby {
		init => "require 'time'"
		code => 'event.set("logstash_end_timestamp", Time.now())
				tend = event.get("logstash_end_timestamp").to_f
				tstart = event.get("logstash_start_timestamp").to_f
				diff = tend - tstart
				event.set("logstash_elaboration_time", diff)'
	}
}
output {	
	elasticsearch {
		hosts => ["localhost:9200"]
		index => "my_index"
		pipeline => "add_ingest_timestamp"
	}
}

I also wanted to measure the time between the end of the Logstash filter elaboration and the moment when the event is actually sent to Elasticsearch to be indexed.
So I used an ingest pipeline that computes the difference between the value of logstash_end_time and the value of the meta-field _ingest.timestamp. This difference is then added as a numeric field to the event.

Over a time window of ~15 minutes where the input stream from Filebeat was fairly high, I got the following average stats from the monitoring tool:

  • Logstash:

    • CPU Usage: 26%
    • 5 min average load: 2
    • Heap used: 17%
    • Events Ingested: 800
    • peak Event Emitted Rate: 25 events/s
    • peak Event Latency: 3200 ms
  • Elasticsearch:

    • CPU Usage: 10%
    • Load average: 2.2
    • Heap used: 25%
    • minimum Indexing Rate: 30 events/s
    • peak Indexing Latency: 0.25 ms

I did not include Filebeat monitoring metrics since I think it is the fastest component of the three.

For the Logstash elaboration time and ingest delay (obtained as explained above), over the same time window I get approximately the following averages per event:

  • logstash elaboration time: 30000 ms
  • ingest delay: 1250 ms

So it seems that most of the delay is due to what happens inside the Logstash pipeline filter.
What could be the cause of this slow-down?

I tried to tune the properties pipeline.workers, pipeline.batch.size and pipeline.batch.delay, but nothing substantially changed.
I also increased Elasticsearch and Logstash committed heap memory, but again I did not get any improvement.

Does anyone have suggestions on how to speed-up the ingestion?

Logstash can only process as much as Elasticsearch is able to consume, so the first step is generally to check that Elasticsearch is not limited by CPU or disk I/O or suffering frequent or long GC.

How much resources do the components have allocated? What type of hardware are they deployed on? What type of storage is used?

Hi Christian,

thanks for the reply.

Here are some details about the host where I installed ELK.

The processor model: Intel Xeon CPU E5-2683 v4 @ 2.10GHz
Number of cores: 4
OS: Red Hat Enterprise Linux Server 7.5
Total Memory: 32GB

On this same host I have 1 Elasticsearch node, 1 Logstash node, 1 Kibana node with heap committed: 4GB, 2GB and 2GB, respectively.
I can not give them more memory because there are other processes running other than the ELK software.

The file system where I store the data is of XFS type, with total capacity 100GB. Right now the Elasticsearch data node is occupying a volume of ~8GB.

Over a period of 24 hours, with a steady flow of input coming from the Filebeat istances running on the monitored hosts, I registered the following resources consumption (data from X-pack Monitoring)

Elasticsearch:

  • CPU usage:
    • average: 8%
    • max: 30%
  • JVM Heap usage:
    • average: 49% of committed
    • max: 76% of committed
  • System Load (10 min average):
    • average: 4.45
    • max: 9.25

Logstash:

  • CPU usage:
    • average: 26%
    • max: 30%
  • JVM Heap usage:
    • average: 17% of committed
    • max: 23% of committed
  • System Load (10 min average):
    • average: 2
    • max: 9

Where can I get metrics about the garbage collections and disk IO for Elasticsearch JVM?

However, as I wrote in my original post, the bottleneck of the ingestion process seems to be the processing performed by the Logstash pipeline filter plugins, in particular a kv filter.

Is there some way to improve the performace of the Logstash pipeline, other than tuning the properties pipeline.*?

If elasticsearch is running at 8% CPU but seeing a load average of over 4 on a 4 CPU server I would lean towards it being IO limited. Hard to be sure without a lot more data.

1 Like

Hi @Badger,

as you suggested, disk I/O limitations could be the cause of the poor Elasticsearch performance in terms of indexing rate and of the high event indexing latency.
How can I monitor duration of disk I/O operations made by ES?

However, how do disk I/O resources impact on Logstash performance? I thought Logstash made minimum use of disk, unless persistent queues are used.

What I observe (as I wrote in the original post) is that, most of the delay in the overall ingestion process is due to what happens inside the Logstash pipeline, as evidenced by the high Event Latency (collected by X-pack monitoring) and the custom metric, called logstash_elaboration_time, which I compute by inserting start and end timestamps in the filter { } part of the pipeline.
Although, I must admit that I am not quite sure the interval of time that I am measuring is exactly the amount of time that it takes for an event to be processed (i.e. from the moment it is read from the input queue to the moment when it is sent to the output stage of the pipeline).

Moreover the Pipeline Viewer UI in Kibana does not help a lot.
In fact, most of the time the values of CPU usage and event latency for single filter plugins are not highlighted as anomalous.

If logstash is writing to elasticsearch it cannot write to elasticsearch faster than elasticsearch can index data. It might be informative to swap out the elasticsearch output and replace it with a file output writing to /dev/null and see what the logstash performance looks like then.

Hi @Badger,

I tried to replace the elasticsearch output with a file output writing to /dev/null, as you suggested.
Indeed, I observed a decrease in the total time required for the ingestion, but still not enough. I was still getting typical times of order 5000 ms (average Logstash Latency from X-pack Monitoring) with peaks of ~10 seconds. So, in fact, I eliminated the time spent by Elasticsearch indexing documents (which was the smaller contribution to the total time).

Finally I was allowed to install additional Logstash nodes on other machines and so I could redistribute the load. This solution decreased a lot the typical ingestion time with respect to the original setup.
I got approximately:

  • Logstash latency: 2000ms
  • logstash elaboration time (measured via the pipeline itself as explaiend in the OP): ~500ms

At the end of the day, I figured out that, with the available infrastructure, the best way I can speed up the ingestion process is to add multiple logstash nodes on different hosts, while still keeping just 1 Elasticsearch node.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.