Elastic APM High Response Errors Rate

If you are asking about a problem you are experiencing, please use the following template, as it will help us help you. If you have a different problem, please delete all of this text :slight_smile:

Kibana version:
ElastiCloud - 7.2.0

Elasticsearch version:
ElastiCloud - 7.2.0

APM Server version:
ElastiCloud - 7.2.0

APM Agent language and version:
Ruby / Rails - elastic-apm (= 2.7.0)

Browser version:

Original install method (e.g. download page, yum, deb, from source, etc.) and version:

Fresh install or upgraded from other version?
Upgraded

Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs? Are you using a load balancer in front of the APM Servers? Have you changed index pattern, generated custom templates, changed agent configuration etc.

Currently trying this APM Server configuration:

output.elasticsearch.bulk_max_size: 5000
output.elasticsearch.worker: 8
queue.mem.events: 40000

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
Our APM server is struggling to keep up with ingestion but I can't pin point exactly why as the APM server itself and the elasticsearch nodes seems to have enough resource. Here is the problematic behavior that we are seeing:

When we turned on the APM collection in the instances, for the first hours or so everything was fine all samples were being processed correctly. After a certain point the APM server starts to reject the samples due to HTTP requests rejected due to internal queue filling up. I've been trying to play around with the configuration above but hasn't succeeded yet.

Reasoning about the queue fiilling up I've been trying to investigate whether or not the ElasticSearch cluster was able to keep up with indexing rate requested by the APM, but so far couldn't really correlate anything that exemplifies that as the node seems to still have enough resources. I can certainly see the ElasticSearch indexing rate decreasing which correlates with the APM Server error events, but rather than ElasticSearch not being able to keep up it seems more that the index rate decreased because the APM server stopped sending sample to ElasticSearch:

From the Ruby APM agent side of things I've been getting a lot of error messages saying either:

  • [ElasticAPM]: APM Server not responding in time, terminating request

    OR

  • [ElasticAPM]: APM Server not responding in time, terminating request

The latter I can understand as I can see from the server side that the queues are full but the former one makes no sense to me because the APM server seems to be doing just fine as per picture below:

Small CPU footprint, more than enough available memory (2GB x 2 Availability Zones), etc...

By the Ruby APM logs it seems to be that APM server is not doing well, but I can't really understand or corroborate that with none of the graphics provided by XPACK.

Does anyone have any tips on understanding what's going on here?

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant):

  • [ElasticAPM]: APM Server not responding in time, terminating request
  • [ElasticAPM]: APM Server not responding in time, terminating request

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.