APM-Server 503 request time out

elasticsearch cluster: 11 nodes (8 CPU/62GB RAM, i3.2xlarge instances in AWS) version 8.13.2
kibana: 1 node (4CPU, 8GB RAM) c5.xlarge instance in AWS, version 8.13.2
apm-server: 1 node (8CPU, 16GB RAM), c5.2xlarge instance in AWS, version 8.13.2
Agent:
APM jar version 1.50.0
java version 21

/etc/apm-server/apm-server.yml config

######################### APM Server Configuration #########################

################################ APM Server ################################

apm-server:
  # Defines the host and port the server is listening on. Use "unix:/path/to.sock" to listen on a unix domain socket.
  host: "IP_APM_SERVER_INSTANCE:8200"

  # Maximum permitted size in bytes of a request's header accepted by the server to be processed.
  max_header_size: 1048576

  # Maximum amount of time to wait for the next incoming request before underlying connection is closed.
  idle_timeout: 90s

  # Maximum permitted duration for reading an entire request.
  read_timeout: 300s

  # Maximum permitted duration for writing a response.
  write_timeout: 300s

  # Maximum permitted size in bytes of an event accepted by the server to be processed.
  max_event_size: 1457600
  
  #---------------------------- APM Server - RUM Real User Monitoring ----------------------------

  rum:
    enabled: true

    event_rate:
      limit: 300
      lru_size: 1000

    allow_origins: ['*']


  #---------------------------- APM Server - Agent Configuration ----------------------------

  kibana:
    # For APM Agent configuration in Kibana, enabled must be true.
    enabled: true
    host: "IP_KIBANA_INSTANCE:5601"

#================================= General =================================


queue:
  # Queue type by name (default 'mem').
  mem:
    # Max number of events the queue can buffer.
    events: 100000


#-------------------------- Elasticsearch output --------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  # Scheme and port can be left out and will be set to the default (`http` and `9200`).
  # In case you specify and additional path, the scheme is required: `http://localhost:9200/path`.
  # IPv6 addresses should always be defined as: `https://[2001:db8::1]:9200`.
  hosts: ["DNS_OF_THE_ELASTICSEARCH_ALB:9200"]

  # Boolean flag to enable or disable the output module.
  enabled: true

  # Number of workers per Elasticsearch host.
  worker: 8

  # The maximum number of events to bulk in a single Elasticsearch bulk API index request.
  # The default is 50.
  bulk_max_size: 80000

  # Configure http request timeout before failing an request to Elasticsearch.
  timeout: 300

  
#================================= Paths ==================================

# The home path for the apm-server installation. This is the default base path
# for all other path settings and for miscellaneous files that come with the
# distribution.
# If not set by a CLI flag or in the configuration file, the default for the
# home path is the location of the binary.
path.home: /usr/share/apm-server

# The configuration path for the apm-server installation. This is the default
# base path for configuration files, including the main YAML configuration file
# and the Elasticsearch template file. If not set by a CLI flag or in the
# configuration file, the default for the configuration path is the home path.
path.config: /etc/apm-server

# The data path for the apm-server installation. This is the default base path
# for all the files in which apm-server needs to store its data. If not set by a
# CLI flag or in the configuration file, the default for the data path is a data
# subdirectory inside the home path.
path.data: /var/lib/apm-server

# The logs path for an apm-server installation. If not set by a CLI flag or in the
# configuration file, the default is a logs subdirectory inside the home path.
path.logs: /var/log/apm-server

#================================= Logging =================================

# There are three options for the log output: syslog, file, and stderr.
# Windows systems default to file output. All other systems default to syslog.

# Sets the minimum log level. The default log level is info.
# Available log levels are: error, warning, info, or debug.
logging.level: info

# Enable debug output for selected components. To enable all selectors use ["*"].
# Other available selectors are "beat", "publish", or "service".
# Multiple selectors can be chained.
#logging.selectors: [ ]

# Send all logging output to syslog. The default is false.
#logging.to_syslog: true

# If enabled, apm-server periodically logs its internal metrics that have changed
# in the last period. For each metric that changed, the delta from the value at
# the beginning of the period is logged. Also, the total values for
# all non-zero internal metrics are logged on shutdown. The default is false.
#logging.metrics.enabled: false

# The period after which to log the internal metrics. The default is 30s.
#logging.metrics.period: 30s

# Logging to rotating files. When true, writes all logging output to files.
# The log files are automatically rotated when the log file size limit is reached.
logging.to_files: true
logging.files:
  # Configure the path where the logs are written. The default is the logs directory
  # under the home path (the binary location).
  path: /var/log/apm-server

  # The name of the files where the logs are written to.
  name: apm-server

  # Configure log file size limit. If limit is reached, log file will be
  # automatically rotated.
  rotateeverybytes: 62914560 # = 60MB

  # Number of rotated log files to keep. Oldest files will be deleted first.
  keepfiles: 7

  # The permissions mask to apply when rotating log files. The default value is 0600.
  # Must be a valid Unix-style file permissions mask expressed in octal notation.
  permissions: 0600

  # Enable log file rotation on time intervals in addition to size-based rotation.
  # Intervals must be at least 1s. Values of 1m, 1h, 24h, 7*24h, 30*24h, and 365*24h
  # are boundary-aligned with minutes, hours, days, weeks, months, and years as
  # reported by the local system clock. All other intervals are calculated from the
  # Unix epoch. Defaults to disabled.
  #interval: 0

# Set to true to log messages in json format.
logging.json: true

# Set to true, to log messages with minimal required Elastic Common Schema (ECS)
# information. Recommended to use in combination with `logging.json=true`.
#logging.ecs: true

With these settings when a customer sends events he sees in the logs of the agent:

024-07-04 11:36:34,056 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.AbstractIntakeApiHandler - Backing off for 9 seconds (+/-10%)2024-07-04 11:36:43,592 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -12024-07-04 11:36:43,712 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -1

On the apm-server side, I see the following in the logs:

{"log.level":"error","@timestamp":"2024-07-08T15:12:35.230Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (identity 1.0.0)","source.address":"source_address_ip","http.request.id":"43w4242-c434-42re-bree-b9c16b5719a7432f","event.duration":15005476018,"http.request.body.bytes":2320,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-08T15:12:35.252Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (identity 1.0.0)","source.address":"source_address_ip","http.request.id":"request_id","event.duration":15007539970,"http.request.body.bytes":2310,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-08T15:12:36.359Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (identity 1.0.0)","source.address":"source_address_ip","http.request.id":"request_id","event.duration":15012934832,"http.request.body.bytes":3129,"http.response.status_cod

Can you please help me fix this? Thank you!

Hi @Anca_Linca,

Thanks for reaching out and sorry to hear that you are experiencing an issue with the APM Server.

We've recently found an issue where the APM server issues too many small bulk requests due to misconfiguration, and In return, more bulk requests are made to maintain the original throughput. This was introduced in the version 8.13.0 which matches the version you use(8.13.2).

Here's the link to the known issue entry, you can find more details about this. Here is the suggested workaround to mitigate the issue that I've copied from the linked ticket:

  • In apm-server.yml, set output.elasticsearch.flush_bytes: 1mib.

Let us know if this helps, otherwise happy to assist further.

Thanks,
Kyungeun

Hi Kim,

Thank you for your reply.
I've added the setting you suggested.
I am still seeing the following errors in the apm-server logs:

{"log.level":"error","@timestamp":"2024-07-09T08:07:33.653Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (identity-prod 1.0.0)","source.address":"source_ip_address","http.request.id":"a76005a2-4852-4699-85a0-932fccdeaa6e","event.duration":15008926658,"http.request.body.bytes":2084,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-09T08:07:34.460Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":59},"message":"request timed out","service.name":"apm-server","url.original":"/intake/v2/events","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (entitlement 1.0.0)","source.address":"source_ip_address","http.request.id":"deb311fb-df3c-49e9-a0a6-a0f5e56d5009","event.duration":15010621235,"http.request.body.bytes":4411,"http.response.status_code":503,"error.message":"request timed out","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2024-07-09T08:19:16.366Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":62},"message":"not modified","service.name":"apm-server","url.original":"/config/v1/agents","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (catalog 1.0.0)","source.address":"source_ip_address","http.request.id":"5dd55bf3-4892-44e0-82bb-b32c27812e76","event.duration":4905369,"http.request.body.bytes":856,"http.response.status_code":304,"ecs.version":"1.6.0"}

I can only see 503 errors now, no 200 or 202. From time to time I see these as well:

{"log.level":"info","@timestamp":"2024-07-09T08:21:06.586Z","log.logger":"request","log.origin":{"function":"github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1","file.name":"middleware/log_middleware.go","file.line":62},"message":"not modified","service.name":"apm-server","url.original":"/config/v1/agents","http.request.method":"POST","user_agent.original":"apm-agent-java/1.50.0 (identity-prod 1.0.0)","source.address":"source_ip_address","http.request.id":"9d09e60d-128f-45b4-978f-62b35ecad6d4","event.duration":6423703,"http.request.body.bytes":844,"http.response.status_code":304,"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-07-09T08:21:10.735Z","log.logger":"beater","log.origin":{"function":"github.com/elastic/apm-server/internal/beater.waitReady","file.name":"beater/waitready.go","file.line":62},"message":"precondition 'apm integration installed' failed: error querying Elasticsearch for integration index templates: unexpected HTTP status: 404 Not Found ({\"error\":{\"root_cause\":[{\"type\":\"resource_not_found_exception\",\"reason\":\"index template matching [metrics-apm.transaction.1m] not found\"}],\"type\":\"resource_not_found_exception\",\"reason\":\"index template matching [metrics-apm.transaction.1m] not found\"},\"status\":404}): to remediate, please install the apm integration: https://ela.st/apm-integration-quickstart","service.name":"apm-server","ecs.version":"1.6.0"}

I wanted to put all error message in case they are related.

On the agent side, the following error:

2024-07-09 08:08:48,422 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -1
2024-07-09 08:08:49,880 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -1

Thank you once more,
Anca

I rolled back to a version that I knew it worked: 7.9.1 and the errors disappeared. I still don't know what was the root cause of this, nor how to fix it. Seems odd that a previous version works just fine.

Hi @Anca_Linca ,

I can only see 503 errors now, no 200 or 202. From time to time I see these as well:

Were you able to see any APM events getting indexed in Elasticsearch? If no then do you see any issues (like high load) with the ES cluster?

{"log.level":"error","@timestamp":"2024-07-09T08:21:10.735Z","log.logger":"beater","log.origin":{"function":"github.com/elastic/apm-server/internal/beater.waitReady","file.name":"beater/waitready.go","file.line":62},"message":"precondition 'apm integration installed' failed: error querying Elasticsearch for integration index templates: unexpected HTTP status: 404 Not Found ({"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"index template matching [metrics-apm.transaction.1m] not found"}],"type":"resource_not_found_exception","reason":"index template matching [metrics-apm.transaction.1m] not found"},"status":404}): to remediate, please install the apm integration: https://ela.st/apm-integration-quickstart","service.name":"apm-server","ecs.version":"1.6.0"}

I would like to understand the root cause of this error message too. Can you clarify how you were running your APM-Server when you upgraded to 8.x (Elastic stack, APM-Server binary, or APM integration)?