Connection timed out problem

Kibana version: 7.4.1

Elasticsearch version: 7.4.1

APM Server version: 7.6.1

APM Agent language and version: java

elasticapm.properties

environment=twtpehswj2ui01
application_packages=com.delta
server_urls=http://10.148.208.47:8090
log_level=TRACE
log_file=AGENT_HOME/../logs/elastic-apm.log
log_file_size=10mb
server_timeout=0

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

Sometime it receives apm data correctly, but there are time periods that can not receive the data (there are empty spaces in the chart below). And then I check the log (attached below). It says "Connection timed out".

But I try to hit http://10.148.208.47:8090/intake/v2/events by postman and it response 202. So I think connection is not a problem.

Any suggestion how to fix the problem?
Thanks.

Apm view in Kibana

Error log

2020-09-17 08:57:58,900 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type TRANSACTION with this error: Connection timed out (Connection timed out)
2020-09-17 09:12:32,628 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:14:39,860 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:16:48,116 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:18:59,571 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)

Entire log

From this line in the logs, it seems that you have a very optimistic timeout for server connection 0s, which probably explains why do you get so much Connection timed out errors. Can you try with a value higher than zero like 1s or with the default one (5s) ? Also, please note that this value should have a unit and is not just a number.

Hi,

Previously, I've tried to set server_timeout to 5s and 60s. And the problem still exists.

And then I found a post.

If a request to the APM server takes longer than the configured timeout, the request is cancelled and the event (exception or transaction) is discarded. Set to 0 to disable timeouts.

That's why I tried to set server_timeout to zero and without unit. I want to disable the timeout functionality. But it seems that this can not solve the problem.

Hi @bob96589,

Could you check in your server logs during the time frame where no data appears to be sent ?

If there is no visible activity during those time frames, it means the agent might not have been able to reach the server at all, which would indicate more a network issue rather than an issue with the agent. Increasing log level server-side might be required.

I assume that you only have a single apm-server instance, and thus my hypotheses are the following:

  • if you have a single agent, if there is nothing in server logs after increasing log level, that means there is an issue on the network
  • if you have more than one agent, if there is nothing in server logs, the issue is still on the network, but more on the server side (as no other agent seem able to reach it)
  • if you have more than one agent and some of them are able to reach the server, that means the issue might be on the network on agent side, or that there is a bug in the agent.