Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
Sometime it receives apm data correctly, but there are time periods that can not receive the data (there are empty spaces in the chart below). And then I check the log (attached below). It says "Connection timed out".
2020-09-17 08:57:58,900 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type TRANSACTION with this error: Connection timed out (Connection timed out)
2020-09-17 09:12:32,628 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:14:39,860 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:16:48,116 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
2020-09-17 09:18:59,571 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Failed to handle event of type METRICS with this error: Connection timed out (Connection timed out)
From this line in the logs, it seems that you have a very optimistic timeout for server connection 0s, which probably explains why do you get so much Connection timed out errors. Can you try with a value higher than zero like 1s or with the default one (5s) ? Also, please note that this value should have a unit and is not just a number.
Previously, I've tried to set server_timeout to 5s and 60s. And the problem still exists.
And then I found a post.
If a request to the APM server takes longer than the configured timeout, the request is cancelled and the event (exception or transaction) is discarded. Set to 0 to disable timeouts.
That's why I tried to set server_timeout to zero and without unit. I want to disable the timeout functionality. But it seems that this can not solve the problem.
Could you check in your server logs during the time frame where no data appears to be sent ?
If there is no visible activity during those time frames, it means the agent might not have been able to reach the server at all, which would indicate more a network issue rather than an issue with the agent. Increasing log level server-side might be required.
I assume that you only have a single apm-server instance, and thus my hypotheses are the following:
if you have a single agent, if there is nothing in server logs after increasing log level, that means there is an issue on the network
if you have more than one agent, if there is nothing in server logs, the issue is still on the network, but more on the server side (as no other agent seem able to reach it)
if you have more than one agent and some of them are able to reach the server, that means the issue might be on the network on agent side, or that there is a bug in the agent.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.