APM server tuning for heavy workload

senyam08 · February 1, 2023, 4:01pm

After deploying elastic cluster for tracing, we see following errors due to heavy workload in APM server/Agents
co.elastic.apm.agent.report.ApmServerReporter - dropped events because of full queue: 305

[elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1

Could you provide info on how to monitor APM server/java agent queue and ways to tune to resolve above errors? we are in 8.4 version and planning to upgrade to 8.61

Jonas_Kunz · February 2, 2023, 12:16pm

The error messages you are seeing indicates that the APM server is overloaded as you said.
You should start by [monitoring the APM server] (Monitor APM Server | APM User Guide [8.6] | Elastic) and either scale your deployment vertically or horizontally or reduce the amount of data generated on the agent side via sampling.

Eyal_Koren · February 5, 2023, 6:33am

Adding to what @Jonas_Kunz wrote, reducing the amount of data produced by the agents may be crucial, so in addition to sampling, check your configuration and make sure you did not set any that may cause this. For example, the trace_methods config may be the cause of creation of large amounts of spans if set to capture too many methods or methods that are executed very frequently.
In addition, span_min_duration and exit_span_min_duration can reduce the number of captured and sent spans by discarding the very fast ones.
Lastly, take a look at our short tuning guide, where there is a bit more info.

senyam08 · February 13, 2023, 5:18pm

Thanks. Is there document with parameters to tune APM server with 8.X version?
I see the below parameters in legacy APM server document
apm-server.max_event_size
queue.mem.events
apm-server.read_timeout | apm-server.write_timeout
output.elasticsearch.worker
output.elasticsearch.bulk_max_size
output.elasticsearch.timeout

senyam08 · February 13, 2023, 5:51pm

could someone send information on how do i update with different value for some of the properties defined in apm-server.yml file

github.com

elastic/apm-server/blob/main/apm-server.yml

######################### APM Server Configuration #########################

################################ APM Server ################################

apm-server:
  # Defines the host and port the server is listening on. Use "unix:/path/to.sock" to listen on a unix domain socket.
  host: "127.0.0.1:8200"

  # Agent authorization configuration. If no methods are defined, all requests will be allowed.
  #auth:
    # Agent authorization using Elasticsearch API Keys.
    #api_key:
      #enabled: false
      #
      # Restrict how many unique API keys are allowed per minute. Should be set to at least the amount of different
      # API keys configured in your monitored services. Every unique API key triggers one request to Elasticsearch.
      #limit: 100

    # Define a shared secret token for authorizing agents using the "Bearer" authorization method.
    #secret_token:

This file has been truncated. show original

marclop · February 21, 2023, 8:38am

APM Server doesn't need any output tunning, and versions >= 8.6.0 ship with major performance improvements. These are most noticeable given a powerful enough machine (>=6 cores / threads).

APM Server needs to be scaled in conjunction with Elasticsearch, making sure that your Elasticsearch is scaled out and up enough to be able to handle the load created by APM Server. In the majority of cases, Elasticsearch isn't scaled up or out to be able to process the APM Server's throughput, be sure to keep an eye on Elasticsearch's CPU usage and APM Server's usage. Very long response times may be an indicator that Elasticsearch is overloaded (needs more resources, more machines or the indices need to be tuned with a higher number of shards (generally 1 shard per Elasticsearch node)). By default the APM indices use a single shard.

My colleagues have pointed you to the agent specific configuration that can be tuned to reduce the amount of data sent to the APM Server.

senyam08 · February 21, 2023, 3:09pm

Thank you for detailed explanation. is there any prebuilt dashboard to proactively monitor health of APM server/Elastic search and find out any issues with bulk processing/scaling?

marclop · February 22, 2023, 12:09am

There isn't one as far as I know. However, you can set up Stack Monitoring and use it to monitor both Elasticsearch and APM Server metrics: Stack Monitoring | Kibana Guide [8.6] | Elastic.

senyam08 · February 23, 2023, 4:23pm

Thanks Marc. After adding APM agent tuning config changes and increased Elasticsearch data/ingest instances and APM server, we are still seeing following error( we are @ 8.6.1 version)

2023-02-23 16:18:38,959 [elastic-apm-server-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 4 seconds (+/-10%)
2023-02-23 16:18:38,960 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
2023-02-23 16:18:38,960 [elastic-apm-server-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null

marclop · February 24, 2023, 12:10am

It is really hard and nearly impossible for me to give any further guidance or suggestion without knowing any of the specifics of your set up. Could you please provide detailed metrics for:

APM & Elasticsearch CPU utilization.
APM & Elasticsearch sizing and topology (CPU, RAM).

You can obtain these metrics by enabling stack monitoring and collecting screenshots of at least a 12-24h time window.

senyam08 · February 27, 2023, 5:57pm

We got the following topology

APM server running 5 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 60% during read timeout error
Elasticsearch datanodes running 6 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 40% during read timeout error
Elasticsearch ingest nodes running 3 instances in Kubernetes - 2 CPU request and no limit - CPU usage is less than 80% during read timeout error
Elasticsearch master running 3 instances in Kubernetes - 1 CPU request and no limit - CPU usage is less than 20% during read timeout error

Do i need to update any of the tuning params for APM server as i see read timeout and queue full error in APM agents logs?

lahsivjar · March 2, 2023, 5:07am

Thanks for sharing more details. From what I can glean, it seems that all the pods are running without CPU limit, can you share details on how you are calculating the CPU usage? Also, I would suggest looking at Kubernetes node's CPU utilization to make sure that the node's are not overwhelmed.

It shouldn't be needed. If I understand correctly, the APM-Server version is still 8.4(?). If yes can you upgrade it to the latest 8.6.x version. As Marc mentioned earlier, APM-Server versions >= 8.6.0 ship with major performance improvements and autoscaling of internal indexers.

system · March 23, 2023, 1:07am

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fine tune APM server settings on the hosted solution (cloud.elastic.co) APM server	2	579	May 30, 2019
Elastic APM High Response Errors Rate APM ruby , server	1	1290	July 30, 2019
50 percent depreciation in API performace while Using APM agent APM	10	849	March 12, 2019
Performance tuning help APM server	2	528	December 1, 2019
APM: 503 Queue is full, server sleeping, nothing helps APM server	10	4415	November 13, 2019

APM server tuning for heavy workload

Related topics