Peridiocally Java APM Agent experiences errors with connection to APM server

Kibana version: v8.6.2

Elasticsearch version:
v8.6.2

APM Server version:
v8.6.2

APM Agent language and version: Java APM Agent 1.34.1

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant): Periodically Java APM agents in our three environments log errors of connection timeout with APM Server. The spikes of these erros happen more or less at the same time in all three environments. We are using Elastic Stack Cloud.

Provide logs and/or server output (if relevant):
Most common error is this:

2023-06-09 05:41:26,046 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1

The DEBUG mode showed this:

02:41:26.032
2023-06-09 05:41:26,032 [reactor-http-epoll-1] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ReadOperationHandler#handle' 00-d6edea5411e1e70cfebb337f216f8fb5-28a11d3ff04bf607-01 (5e254f40) (7)
02:41:26.032
2023-06-09 05:41:26,032 [reactor-http-epoll-1] DEBUG co.elastic.apm.agent.impl.ActiveStack - Deactivating 'ReadOperationHandler#handle' 00-d6edea5411e1e70cfebb337f216f8fb5-28a11d3ff04bf607-01 (5e254f40) on thread 45
02:41:26.046
2023-06-09 05:41:26,046 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 0 seconds (+/-10%)
02:41:26.046
2023-06-09 05:41:26,046 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
02:41:26.046
2023-06-09 05:41:26,046 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Sending payload to APM server failed
java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:?]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:115) ~[?:?]
	at java.net.SocketInputStream.read(SocketInputStream.java:168) ~[?:?]
	at java.net.SocketInputStream.read(SocketInputStream.java:140) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:478) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472) ~[?:?]
	at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]
	at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1374) ~[?:?]
	at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:985) ~[?:?]
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:252) ~[?:?]
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:292) ~[?:?]
	at java.io.BufferedInputStream.read(BufferedInputStream.java:351) ~[?:?]
	at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:754) ~[?:?]
	at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:689) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1615) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520) ~[?:?]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:250) ~[?:?]
	at co.elastic.apm.agent.report.AbstractIntakeApiHandler.endRequest(AbstractIntakeApiHandler.java:175) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:68) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:36) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at java.lang.Thread.run(Thread.java:829) [?:?]
02:41:26.046
2023-06-09 05:41:26,046 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null
02:41:26.047
2023-06-09 05:41:26,046 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Starting new request to https://******.apm.southamerica-east1.gcp.elastic-cloud.com:443/intake/v2/events
02:41:26.055
2023-06-09 05:41:26,055 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Scheduling request timeout in 10s
02:41:26.056
2023-06-09 05:41:26,056 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ReadOperationHandler#handle' 00-74155f2ee1bd973376ed1f6ae8550aee-c824769e7264f14f-01 (2c2e1bf3) (4)
02:41:26.056
2023-06-09 05:41:26,056 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving WAKEUP event (sequence 151368)
02:41:26.056
2023-06-09 05:41:26,056 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 151369)

There are also errors like this:

15:39:20.033
2023-06-08 18:39:20,033 [elastic-apm-remote-config-poller] INFO  co.elastic.apm.agent.report.ssl.TLSFallbackSSLSocket - APM Server identity could not be verified
15:39:20.033
2023-06-08 18:39:20,033 [elastic-apm-remote-config-poller] INFO  co.elastic.apm.agent.report.ssl.TLSFallbackSSLSocket - Local certificates: null
15:39:20.034
2023-06-08 18:39:20,034 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.report.ApmServerClient - Exception while interacting with APM Server, trying next one.
15:39:20.037
2023-06-08 18:39:20,037 [elastic-apm-remote-config-poller] ERROR co.elastic.apm.agent.configuration.ApmServerConfigurationSource - connect timed out
15:39:20.037
2023-06-08 18:39:20,037 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Scheduling next remote configuration reload in 300s
15:39:24.392
2023-06-08 18:39:24,392 [elastic-apm-shared] DEBUG co.elastic.apm.agent.micrometer.MicrometerMetricsReporter - Reporting 73 meters
15:39:24.395
2023-06-08 18:39:24,395 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 98395)
15:39:24.396
2023-06-08 18:39:24,396 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Starting new request to https://*****.apm.southamerica-east1.gcp.elastic-cloud.com:443/intake/v2/events

And this:

22:19:26.032
2023-06-08 01:19:26,032 [reactor-http-epoll-1] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ReadOperationHandler#handle' 00-9d64486a83c4b3927025e6cce9ce0568-dadd4ebda66182c0-01 (35309251) (7)
22:19:26.032
2023-06-08 01:19:26,032 [reactor-http-epoll-1] DEBUG co.elastic.apm.agent.impl.ActiveStack - Deactivating 'ReadOperationHandler#handle' 00-9d64486a83c4b3927025e6cce9ce0568-dadd4ebda66182c0-01 (35309251) on thread 45
22:19:26.032
2023-06-08 01:19:26,032 [reactor-http-epoll-1] DEBUG co.elastic.apm.agent.impl.transaction.AbstractSpan - decrement references to 'ReadOperationHandler#handle' 00-9d64486a83c4b3927025e6cce9ce0568-dadd4ebda66182c0-01 (35309251) (6)
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.ssl.TLSFallbackSSLSocket - APM Server identity could not be verified
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.ssl.TLSFallbackSSLSocket - Local certificates: null
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error trying to connect to APM Server at https://*****.apm.southamerica-east1.gcp.elastic-cloud.com:443/intake/v2/events. Although not necessarily related to SSL, some related SSL configurations corresponding the current connection are logged at INFO level.
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Default cipher suites: [TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_RSA_WITH_AES_256_GCM_SHA384, TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_DHE_DSS_WITH_AES_256_GCM_SHA384, TLS_DHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_DSS_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_RSA_WITH_AES_256_CBC_SHA256, TLS_DHE_DSS_WITH_AES_256_CBC_SHA256, TLS_DHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_DSS_WITH_AES_128_CBC_SHA256, TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_DSS_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Supported cipher suites: [TLS_AES_256_GCM_SHA384, TLS_AES_128_GCM_SHA256, TLS_CHACHA20_POLY1305_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_RSA_WITH_AES_256_GCM_SHA384, TLS_DHE_RSA_WITH_CHACHA20_POLY1305_SHA256, TLS_DHE_DSS_WITH_AES_256_GCM_SHA384, TLS_DHE_RSA_WITH_AES_128_GCM_SHA256, TLS_DHE_DSS_WITH_AES_128_GCM_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_RSA_WITH_AES_256_CBC_SHA256, TLS_DHE_DSS_WITH_AES_256_CBC_SHA256, TLS_DHE_RSA_WITH_AES_128_CBC_SHA256, TLS_DHE_DSS_WITH_AES_128_CBC_SHA256, TLS_ECDH_ECDSA_WITH_AES_256_GCM_SHA384, TLS_ECDH_RSA_WITH_AES_256_GCM_SHA384, TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256, TLS_ECDH_RSA_WITH_AES_128_GCM_SHA256, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA384, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA384, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA256, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA256, TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA, TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_DSS_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_DSS_WITH_AES_128_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_256_CBC_SHA, TLS_ECDH_RSA_WITH_AES_256_CBC_SHA, TLS_ECDH_ECDSA_WITH_AES_128_CBC_SHA, TLS_ECDH_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_GCM_SHA384, TLS_RSA_WITH_AES_128_GCM_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA256, TLS_RSA_WITH_AES_128_CBC_SHA256, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_EMPTY_RENEGOTIATION_INFO_SCSV]
22:19:29.412
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type JSON_WRITER with this error: connect timed out
22:19:29.414
2023-06-08 01:19:29,412 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Event handling failure
java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:399) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:242) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:224) ~[?:?]
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:?]
	at java.net.Socket.connect(Socket.java:609) ~[?:?]
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:299) ~[?:?]
	at co.elastic.apm.agent.report.ssl.TLSFallbackSSLSocket.connect(TLSFallbackSSLSocket.java:219) ~[elastic-apm-agent-1.34.1.jar:1.34.1]
	at sun.net.NetworkClient.doConnect(NetworkClient.java:177) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:474) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:569) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:266) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:373) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:203) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:189) ~[?:?]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168) ~[?:?]
	at co.elastic.apm.agent.report.AbstractIntakeApiHandler.startRequest(AbstractIntakeApiHandler.java:113) ~[elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.startRequest(IntakeV2ReportingEventHandler.java:170) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.handleIntakeEvent(IntakeV2ReportingEventHandler.java:117) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.dispatchEvent(IntakeV2ReportingEventHandler.java:103) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:70) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at co.elastic.apm.agent.report.IntakeV2ReportingEventHandler.onEvent(IntakeV2ReportingEventHandler.java:36) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) [elastic-apm-agent-1.34.1.jar:1.34.1]
	at java.lang.Thread.run(Thread.java:829) [?:?]
22:19:29.414
2023-06-08 01:19:29,414 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 0 seconds (+/-10%)
22:19:29.414
2023-06-08 01:19:29,414 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 15172)
22:19:29.415
2023-06-08 01:19:29,414 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Starting new request to https://apm-caa.apm.southamerica-east1.gcp.elastic-cloud.com:443/intake/v2/events
22:19:29.426
2023-06-08 01:19:29,426 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 15177)
22:19:29.426
2023-06-08 01:19:29,426 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 15178)
22:19:29.426
2023-06-08 01:19:29,426 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Receiving JSON_WRITER event (sequence 15179)

Two different environments:

APM Server monitoring shows this:
First node:

Second node:

Looks like either a networking issue or that the apm-server maybe can't keep up. Do you have apm-server logs from the time periods where the errors in the agents appear?

Is there a way to check the apm-server logs on Elastic Cloud if we are shipping logs and metrics to our deployment? But at least looking at the graphics it doesn't look like that the APM server can't keep up.

All 3 envs are presumably going to different clusters, and if so and they're experiencing connection errors at roughly the same time, it points at network or proxy issues. Graph resolution may not be fine enough for temporary network spikes/outages. I'm not a network expert, so not sure how best to monitor for issues. I've seen some benefit from monitoring tcp retransmissions, if that bumps up, then it's network congestion

Unfortunately all 3 cluster are going to the same ELK cluster.
I made this visualization that shows occurences in all three envs.

So if they are all against the same cluster, and you get the errors across envs, that implies that the cluster is indeed overloaded at those points. This isn't really a Java agent issue, I can't help further (I'm only on the agent side) but maybe someone on the server-side can weigh in

Yes, it seems so but strangely the graphs of system load of our APM instances don't show anything special. It even looks like we don't need to have two APM instances and just one would be enough.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.