APM agent suddenly stopped sending data to APM server

Hi Team,

I have deployed elastic cloud deployment along with APM server and integrations server.
My Deployment version : 8.9.0
Kibana and integration servers with: 1GB RAM, up to 8.4vCPU

I have integrated APM agent with 10 applications running in different nodes, all are running fine continuously but there is a problem with 1 application. It suddenly stopped sending the data to APM server. When I look at the debug logs I found below Error.

2023-10-22 10:19:49,788 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Request flush because the request timeout occurred
2023-10-22 10:19:49,788 [elastic-apm-server-reporter] DEBUG co.elastic.apm.agent.report.AbstractIntakeApiHandler - Flushing 2925 uncompressed 954 compressed bytes
2023-10-22 10:19:50,460 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Beginning scheduled configuration reload (interval is 30 sec)...
2023-10-22 10:19:50,460 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Finished scheduled configuration reload
2023-10-22 10:19:54,790 [elastic-apm-server-reporter] WARN  co.elastic.apm.agent.report.AbstractIntakeApiHandler - Response body: null
2023-10-22 10:19:54,790 [elastic-apm-server-reporter] INFO  co.elastic.apm.agent.report.AbstractIntakeApiHandler - Backing off for 0 seconds (+/-10%)
2023-10-22 10:19:54,790 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -1

Can anyone please tell me what is wrong? I am not able to find anything related to this error and response code in the docs.

Hi @surya_dadi_dhamarake ,

The log message

2023-10-22 10:19:54,790 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.AbstractIntakeApiHandler - Error sending data to APM server: Read timed out, response code is -1

indicates that your application was not able to receive a response from the APM-server, for which very likely network connectivity problems are the root cause.

You can enable debug logging on your agent to rule out other possible root causes, such as a bad proxy configuration.

Hi @Jonas_Kunz ,

I have already enabled debug logging in my APM agent. The error log that you are mentioning is from the debug logging only. I couldn't see any other error other than this.

After the Error log mentioned in my first message, I could see multiple occurrences(may be 50+ times) of below logs

2023-10-22 10:19:59,368 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.util.UrlConnectionUtils - Opening https://APMserverURL/config/v1/agents without proxy
2023-10-22 10:19:59,368 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Reloading configuration from APM Server https://APMserverURL/config/v1/agents
2023-10-22 10:19:59,493 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Configuration did not change
2023-10-22 10:19:59,493 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Scheduling next remote configuration reload in 30s
2023-10-22 10:20:20,461 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Beginning scheduled configuration reload (interval is 30 sec)...
2023-10-22 10:20:20,461 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Finished scheduled configuration reload
2023-10-22 10:20:29,498 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.util.UrlConnectionUtils - Opening https://APMserverURL/config/v1/agents without proxy
2023-10-22 10:20:29,498 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Reloading configuration from APM Server https://APMserverURL/config/v1/agents
2023-10-22 10:20:29,638 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Configuration did not change
2023-10-22 10:20:29,638 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.configuration.ApmServerConfigurationSource - Scheduling next remote configuration reload in 30s
2023-10-22 10:20:50,463 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Beginning scheduled configuration reload (interval is 30 sec)...
2023-10-22 10:20:50,463 [elastic-apm-configuration-reloader] DEBUG co.elastic.apm.agent.impl.ElasticApmTracerBuilder - Finished scheduled configuration reload

and then below logs

2023-10-22 10:48:09,615 [elastic-apm-shared] DEBUG co.elastic.apm.agent.report.ApmServerReporter - Could not add JsonWriter {"metricset":{"timestamp":1697932089615000,"tags":{"name":"PS MarkSweep"},"samples":{"jvm.gc.time":{"value":65891.0},"jvm.gc.count":{"value":11.0}}}}
 to ring buffer as no slots are available
2023-10-22 10:48:09,615 [elastic-apm-shared] DEBUG co.elastic.apm.agent.report.ApmServerReporter - Could not add JsonWriter {"metricset":{"timestamp":1697932089615000,"tags":{"name":"Compressed Class Space"},"samples":{"jvm.memory.non_heap.pool.committed":{"value":18087936.0},"jvm.memory.non_heap.pool.used":{"value":16583912.0},"jvm.memory.non_heap.pool.max":{"value":1073741824.0}}}}
 to ring buffer as no slots are available

The Could not add ... to ring buffer as no slots are available log message indicates that the internal queue of the APM agent used for buffering data before sending it is filling up, because currently no data can be send.

It might also be the case that the error you are seeing is caused by an overloaded APM server. Could you try disabling all other APM-agents sending to that server and check whether the error still persists?

Hi @Jonas_Kunz , I have even tried with 2 applications only but issue still persists. Is there any thing else that we have to look?

That seems strange. Could you provide

  • the full APM-agent debug logs
  • the APM server logs

so that we can further analyse. Both log files should cover the same period of time. You can use GH gists) to upload those logs.

Hi @Jonas_Kunz ,

I have seen a warning message also in our logs. Can you please tell me if this can be the issue?

2023-10-20 21:14:46,962 [https-jsse-nio-443-exec-119] WARN  co.elastic.apm.agent.bci.bytebuddy.ErrorLoggingListener - org.apache.commons.httpclient.HttpMethodDirector uses an unsupported class file version (pre Java 4)) and can't be instrumented. You may try setting the 'instrument_ancient_bytecode' config option to 'true', but notice that it may cause VerificationErrors or other issues.

That warning simply states that org.apache.commons.httpclient.HttpMethodDirector won't be instrumented. It doesn't affect anything else. In particular APM communication would continue regardless of that warning. As Jonas said, the warning looks like comms to the APM server stopped, either for network problems or APM server overload. One other option is that something in the app switched the JVM to using a proxy. We have some logging for proxy usage, search the agent DEBUG logs for proxy

Hi @Jack_Shirazi ,

I am getting only one log related to proxy and I am pasting that below. How can we identify that the apm server is over loaded. do we have any limit in number of services that we integrate? How many services a single instance of APM server with 1 GB ram can handle?

2023-10-23 14:42:38,667 [elastic-apm-remote-config-poller] DEBUG co.elastic.apm.agent.util.UrlConnectionUtils - Opening https://apm-server-url/config/v1/agents without proxy

That rules out proxy issues. Check the APM server logs and CPU load. The limit is throughput not number of services

Could you try disabling all other APM-agents sending to that server and check whether the error still persists?

Hi @Jack_Shirazi ,

It is good to know. May I know how much throughput a server can handle?

This issue is only coming when we integrate APM with our on-prem applications. We have deployed some of the applications in AWS ECS. When we integrate APM with them, there is no issue. So I think it might be the issue with the connectivity between on prem node and the APM server. Can any one provide me clarity on this?