Optimizing Elastic-Agent: How to Reliably Handle 20,000 EPS and Beyond

First Test Scenario

Setup:
Apache JMeter (192.168.4.171) generates 20,000 EPS -> Elastic-Agent (192.168.4.1:514) -> Logstash (192.168.4.1:5044).

Observation:
Logs are not being forwarded from Logstash to Elasticsearch. Therefore, I believe the issue is not related to the Elasticsearch cluster.

Additionally, using netstat -su, I noticed a continuous increase in packet receive errors and receive buffer errors.
I suspect the problem lies with Elastic-Agent, as it might be struggling to handle the EPS load.

Elastic-Agent.conf:

input {
  elastic_agent {
    port => 5044
  }
}

Elastic-Agent.yml:

outputs:
  6117a5a8-bc80-4e8a-9a8c-d8467fc1f481:
    type: logstash
    bulk_max_size: 5000
    worker: 16
    queue.mem.events: 100000
    queue.mem.flush.min_events: 5000
    queue.mem.flush.timeout: 0.1
    compression_level: 1
    idle_connection_timeout: 30
    hosts:
      - '192.168.4.1:5044'

Logstash.yml :

pipeline.workers: 16
pipeline.batch.size: 5000
pipeline.batch.delay: 1

Question:
How can I optimize Elastic-Agent to reliably handle 20,000 EPS, or even higher EPS rates?

Second Test Scenario

Setup:
Apache JMeter (192.168.4.171) generates 20,000 EPS -> Logstash (192.168.4.1:5044) -> Elasticsearch cluster.

Observation:
Logstash successfully processes and forwards 20,000 EPS to the Elasticsearch cluster without dropping any packets.
Using netstat -su, I observed no increase in packet receive errors or receive buffer errors. This suggests that the issue is not related to Linux system configuration.

Logstash.yml :

pipeline.workers: 16
pipeline.batch.size: 5000
pipeline.batch.delay: 1

Logstash.conf:

input {
  udp {
    port => 5044
  }
}

output {
  elasticsearch {
    hosts => ["https://10.1.1.3:9200"]
    data_stream => "true"
    user => "elastic"
    password => "password"
    ssl => true
    ssl_certificate_verification => false
  }
}

Server Specs: 192.168.4.1 (16 vCPUs, 16 GB RAM).

I would appreciate insights into optimizing Elastic-Agent for this high EPS scenario. Thanks!

A couple of specific notes on the settings you've got configured at the moment:

This is an extremely low value for this setting. You should increase this to at least 1s but 5s would be more appropriate for best performance.

The ideal value for queue.mem.events is queue.mem.flush.min_events * worker * 2 which in your case would be 160k.

Finally, you may want to set the "Read Buffer Size" setting for UDP to something like 1 or 5 megabytes.

The easiest way to ensure that Logstash is not the problem would be to have Elastic Agent write directly to Elasticsearch -- have you attempted this yet?

For additional troubleshooting, it would be useful if you could share a copy of the Elastic Agent log during the benchmark as well as any system metrics (CPU usage) from the elastic agent node during the benchmark. You cannot upload logs directly on this forum so I would recommend putting them in a GitHub gist or similar sharing solution.

It is very important to note that log delivery is always going to be unreliable if you're using UDP. When the receiver is overwhelmed it will always drop messages regardless of using elastic agent or logstash.

If log delivery guarantees are important to you then you should use tcp or preferably write the logs to a file which can be monitored by elastic agent.