Optimizing Logstash Performance: Troubleshooting Instability at 10,000 EPS During Stress Tests

I am currently using an architecture with Elastic-Agent for log collection and Logstash for log forwarding. I am conducting stress testing to evaluate the hardware requirements and costs for my collector setup (Elastic-Agent + Logstash) using Apache JMeter to simulate EPS rates of 2000, 5000, and 10000.

The setup is as follows:
Apache JMeter (192.168.3.170) -> Elastic-Agent-[Custom UDP Logs] (192.168.3.172:515) -> Logstash (192.168.3.172:5044) -> Elasticsearch (192.168.3.173:9200).

My current collector setup includes:

  • 4 Core
  • JVM Heap: 4GB
  • Batch Size: 125

When simulating 2000 and 5000 EPS, the Logstash Monitoring curves are relatively stable, and the event drop rate does not fluctuate much. However, at 10000 EPS, the Logstash Monitoring curve becomes unstable, and there is significant event drop variation.

(1)2000EPS


(2)5000EPS


(3)10000EPS


I am aware that the event reception rate of Logstash could be affected by factors such as Elasticsearch’s disk performance, shard configuration, and network.

I am looking for a more straightforward and effective way to quickly identify the cause of instability at 10000 EPS and evaluate the hardware requirements and costs for the Elastic-Agent + Logstash collector setup more accurately.

logstash-udp.conf

input {
elastic_agent {
port => 5044
ssl_enabled => true
ssl_certificate_authorities => ["/etc/logstash/certs/elasticsearch-ca.pem"]
ssl_certificate => "/etc/logstash/certs/logstash.crt"
ssl_key => "/etc/logstash/certs/logstash.pkcs8.key"
ssl_client_authentication => "required"
}
}

output {
elasticsearch {
hosts => ["https://192.168.3.171:9200"]
data_stream => "true"
user => "elastic"
password => "password"
cacert => "/etc/logstash/certs/elasticsearch-ca.pem"
}
}

Can you clarify what in the many images that you posted shows that? I am not seeing any instability.

I used Apache JMeter to send 2000, 5000, and 10000 events per second and observed the event volume in Kibana and the Logstash monitor curves.

The difference between the number of events sent and the actual number received was not significant:

  1. EPS 2000 sustained for 30 minutes = 3,600,000
    events Actual received volume: 3,596,929 hits

  2. EPS 5000 sustained for 30 minutes = 9,000,000
    events Actual received volume: 8,974,598 hits

  3. EPS 10,000 sustained for 30 minutes = 18,000,000
    events Actual received volume: 16,312,575 hits

Additionally, the curve at EPS 10,000 shows significant instability.

I don't really agree with that, but I think you are seeing delays caused by GC cycles.

Besides what Badger said, you also need to consider that with more events, the elasticsearch performance will have more influence.

You will need to start changing the batch size, 125 is pretty low when you start getting more events, as this would result in more requests to Elasticsearch with smaller batches.