Why is there a lot of "Waiting 30s" in the service log?

wajika · November 9, 2021, 8:29am

Waiting 30s... Wait time is taken from max-age directive in Cache-Control header in APM Server's response. dbgIterationsCount: 37436.
Waiting 30s... Wait time is taken from max-age directive in Cache-Control header in APM Server's response. dbgIterationsCount: 36.
Waiting 30s... Wait time is taken from max-age directive in Cache-Control header in APM Server's response. dbgIterationsCount: 27.

Today, when I checked the logs collected by filebeat, I found that many service logs showed "wait" messages. Why? I did not find a solution in "Troubleshoot Common problems"

riferrei · November 9, 2021, 2:49pm

I think you should read this discussion here from last year:

— @riferrei

wajika · November 10, 2021, 5:43am

Thank you for your reply. I read this discussion and didn't understand that my error has any connection with it. What I want to confirm is what does "Waiting 30s" mean? The apm-server queue is full?

riferrei · November 10, 2021, 1:10pm

It may be, but we can't know for sure without proper debugging. If you could, please run the APM Server with the options -e -d "*" to increase verboseness in the details of the log. Also, check if there is a load balancer in-between the agent and the APM server — sometimes, these timeouts may be caused by unbalanced timeouts.

wajika · November 10, 2021, 2:31pm

For debugging, it may be a little troublesome, the apm-server is running on the kubernetes cluster.
The agent reaches the apm-server through the request domain name (nginx forwarding), I checked the nginx log, and there is no exception.

Next, I will spend a little time observing.

riferrei · November 10, 2021, 3:14pm

Ah, that's interesting. Check the timeout configured for NGINX and the APM Server. Ideally, your load balancer timeout should be between the agent timeout and the APM Server timeout — and certainly smaller than the APM Server. For example:

APM agent (10s) Load Balancer (15s) APM Server (30s)

— @riferrei

wajika · November 11, 2021, 3:37am

I intercepted a piece of debug log, there is "done send ack" in the log, it is obvious that the data has been sent to es, but no new data appeared on es.

https://paste.ubuntu.com/p/kNt9BWVdvV/

wajika · November 12, 2021, 6:47am

Paste the service log

https://paste.ubuntu.com/p/yhXJBJTdwM/

riferrei · November 12, 2021, 6:20pm

According to your service log, this message represents the .NET agent trying to poll data from the APM Server and timing out.

github.com

elastic/apm-agent-dotnet/blob/9a6d3b0c81bf5341bcc188ef8bbc66cf24af958c/src/Elastic.Apm/BackendComm/CentralConfig/CentralConfigurationFetcher.cs#L155

    
      
          						, httpResponse == null ? " N/A" : Environment.NewLine + httpResponse.ToString().Indent()
          						, httpResponseBody == null ? "N/A" : httpResponseBody.Length.ToString()
          						, httpResponseBody == null ? " N/A" : Environment.NewLine + httpResponseBody.Indent());
          			}
          			finally
          			{
          				httpRequest?.Dispose();
          				httpResponse?.Dispose();
          			}
          
          
			_logger.Trace()?.Log("Waiting {WaitInterval}... {WaitReason}. dbgIterationsCount: {dbgIterationsCount}."
          					, waitInfo.Interval.ToHms(), waitInfo.Reason, _dbgIterationsCount);
          			await _agentTimer.Delay(_agentTimer.Now + waitInfo.Interval, CancellationTokenSource.Token).ConfigureAwait(false);
          		}
          
          
		private HttpRequestMessage BuildHttpRequest(EntityTagHeaderValue eTag)
          		{
          			var httpRequest = new HttpRequestMessage(HttpMethod.Get, _getConfigAbsoluteUrl);
          			if (eTag != null) httpRequest.Headers.IfNoneMatch.Add(eTag);
          			return httpRequest;
          		}

dbgIterationsCount represents the number of attempts to complete the data polling but failed. Each attempt increments the counter and then schedules another attempt. This means that something is going on between the agent and the APM Server at a network level. If I were you; I would start capturing some network packets from this communication to understand further what is happening. As you can see in the agent's code, it is a simple request-reply HTTP interaction.

— @riferrei

wajika · November 13, 2021, 2:16am

Okay, I will try to capture and analyze.

wajika · November 15, 2021, 2:26am

I checked the captured data, but I don't know how to analyze it. Can you take a look?
https://drive.google.com/file/d/1ulOd19IN4a5YE3IbeSz82zaIAb0Cqkqo

wajika · November 17, 2021, 1:24am

I guess this problem is caused by an incorrect mapping. I added a field to the template. Because this field has multiple formats (such as string, json), Elasticsearch did not map it correctly. Apm-server sent it to The es message should not have received a successful signal, so it has been waiting in a loop.
The above is based on my guess.

system · December 7, 2021, 9:24pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.