APM server tuning for heavy workload

After deploying elastic cluster for tracing, we see following errors due to heavy workload in APM server/Agents
co.elastic.apm.agent.report.ApmServerReporter - dropped events because of full queue: 305

[elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1

Could you provide info on how to monitor APM server/java agent queue and ways to tune to resolve above errors? we are in 8.4 version and planning to upgrade to 8.61

The error messages you are seeing indicates that the APM server is overloaded as you said.
You should start by [monitoring the APM server] (Monitor APM Server | APM User Guide [8.6] | Elastic) and either scale your deployment vertically or horizontally or reduce the amount of data generated on the agent side via sampling.

Adding to what @Jonas_Kunz wrote, reducing the amount of data produced by the agents may be crucial, so in addition to sampling, check your configuration and make sure you did not set any that may cause this. For example, the trace_methods config may be the cause of creation of large amounts of spans if set to capture too many methods or methods that are executed very frequently.
In addition, span_min_duration and exit_span_min_duration can reduce the number of captured and sent spans by discarding the very fast ones.
Lastly, take a look at our short tuning guide, where there is a bit more info.

Thanks. Is there document with parameters to tune APM server with 8.X version?
I see the below parameters in legacy APM server document
apm-server.max_event_size
queue.mem.events
apm-server.read_timeout | apm-server.write_timeout
output.elasticsearch.worker
output.elasticsearch.bulk_max_size
output.elasticsearch.timeout

could someone send information on how do i update with different value for some of the properties defined in apm-server.yml file

APM Server doesn't need any output tunning, and versions >= 8.6.0 ship with major performance improvements. These are most noticeable given a powerful enough machine (>=6 cores / threads).

APM Server needs to be scaled in conjunction with Elasticsearch, making sure that your Elasticsearch is scaled out and up enough to be able to handle the load created by APM Server. In the majority of cases, Elasticsearch isn't scaled up or out to be able to process the APM Server's throughput, be sure to keep an eye on Elasticsearch's CPU usage and APM Server's usage. Very long response times may be an indicator that Elasticsearch is overloaded (needs more resources, more machines or the indices need to be tuned with a higher number of shards (generally 1 shard per Elasticsearch node)). By default the APM indices use a single shard.

My colleagues have pointed you to the agent specific configuration that can be tuned to reduce the amount of data sent to the APM Server.

Thank you for detailed explanation. is there any prebuilt dashboard to proactively monitor health of APM server/Elastic search and find out any issues with bulk processing/scaling?

There isn't one as far as I know. However, you can set up Stack Monitoring and use it to monitor both Elasticsearch and APM Server metrics: Stack Monitoring | Kibana Guide [8.6] | Elastic.

Thanks Marc. After adding APM agent tuning config changes and increased Elasticsearch data/ingest instances and APM server, we are still seeing following error( we are @ 8.6.1 version)

2023-02-23 16:18:38,959 [elastic-apm-server-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 4 seconds (+/-10%)
2023-02-23 16:18:38,960 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
2023-02-23 16:18:38,960 [elastic-apm-server-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null

It is really hard and nearly impossible for me to give any further guidance or suggestion without knowing any of the specifics of your set up. Could you please provide detailed metrics for:

  • APM & Elasticsearch CPU utilization.
  • APM & Elasticsearch sizing and topology (CPU, RAM).

You can obtain these metrics by enabling stack monitoring and collecting screenshots of at least a 12-24h time window.

We got the following topology

APM server running 5 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 60% during read timeout error
Elasticsearch datanodes running 6 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 40% during read timeout error
Elasticsearch ingest nodes running 3 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 80% during read timeout error
Elasticsearch master running 3 instances in Kubernetes - 1 CPU request and no limit - CPU usage is less than 20% during read timeout error

Do i need to update any of the tuning params for APM server as i see read timeout and queue full error in APM agents logs?

Thanks for sharing more details. From what I can glean, it seems that all the pods are running without CPU limit, can you share details on how you are calculating the CPU usage? Also, I would suggest looking at Kubernetes node's CPU utilization to make sure that the node's are not overwhelmed.

It shouldn't be needed. If I understand correctly, the APM-Server version is still 8.4(?). If yes can you upgrade it to the latest 8.6.x version. As Marc mentioned earlier, APM-Server versions >= 8.6.0 ship with major performance improvements and autoscaling of internal indexers.

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.