After deploying elastic cluster for tracing, we see following errors due to heavy workload in APM server/Agents
co.elastic.apm.agent.report.ApmServerReporter - dropped events because of full queue: 305
[elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
Could you provide info on how to monitor APM server/java agent queue and ways to tune to resolve above errors? we are in 8.4 version and planning to upgrade to 8.61
The error messages you are seeing indicates that the APM server is overloaded as you said.
You should start by [monitoring the APM server] (Monitor APM Server | APM User Guide [8.6] | Elastic) and either scale your deployment vertically or horizontally or reduce the amount of data generated on the agent side via sampling.
Adding to what @Jonas_Kunz wrote, reducing the amount of data produced by the agents may be crucial, so in addition to sampling, check your configuration and make sure you did not set any that may cause this. For example, the trace_methods config may be the cause of creation of large amounts of spans if set to capture too many methods or methods that are executed very frequently.
In addition, span_min_duration and exit_span_min_duration can reduce the number of captured and sent spans by discarding the very fast ones.
Lastly, take a look at our short tuning guide, where there is a bit more info.
Thanks. Is there document with parameters to tune APM server with 8.X version?
I see the below parameters in legacy APM server document
apm-server.read_timeout | apm-server.write_timeout
APM Server doesn't need any output tunning, and versions >= 8.6.0 ship with major performance improvements. These are most noticeable given a powerful enough machine (>=6 cores / threads).
APM Server needs to be scaled in conjunction with Elasticsearch, making sure that your Elasticsearch is scaled out and up enough to be able to handle the load created by APM Server. In the majority of cases, Elasticsearch isn't scaled up or out to be able to process the APM Server's throughput, be sure to keep an eye on Elasticsearch's CPU usage and APM Server's usage. Very long response times may be an indicator that Elasticsearch is overloaded (needs more resources, more machines or the indices need to be tuned with a higher number of shards (generally 1 shard per Elasticsearch node)). By default the APM indices use a single shard.
My colleagues have pointed you to the agent specific configuration that can be tuned to reduce the amount of data sent to the APM Server.
APM server running 5 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 60% during read timeout error
Elasticsearch datanodes running 6 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 40% during read timeout error
Elasticsearch ingest nodes running 3 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 80% during read timeout error
Elasticsearch master running 3 instances in Kubernetes - 1 CPU request and no limit - CPU usage is less than 20% during read timeout error
Do i need to update any of the tuning params for APM server as i see read timeout and queue full error in APM agents logs?
Thanks for sharing more details. From what I can glean, it seems that all the pods are running without CPU limit, can you share details on how you are calculating the CPU usage? Also, I would suggest looking at Kubernetes node's CPU utilization to make sure that the node's are not overwhelmed.
It shouldn't be needed. If I understand correctly, the APM-Server version is still 8.4(?). If yes can you upgrade it to the latest 8.6.x version. As Marc mentioned earlier, APM-Server versions >= 8.6.0 ship with major performance improvements and autoscaling of internal indexers.