Apm-server disconnections after upgrading to 7.10

Kibana version: 7.5.1

  • Elasticsearch version: 7.5.1

  • APM Server version: 7.10.0

  • APM Agent language and version: java and rum

  • Original install method (e.g. download page, yum, deb, from source, etc.) and version: yum

  • Fresh install or upgraded from other version? apm upgraded from 7.5.1 to 7.10.0

  • Is there anything special in your setup? we use logstash in between apm-servers and elasticsearch for cache and pipelines such as user agent or geolocation

  • Description of the problem including expected versus actual behavior.:
    I'm upgrading the apm stack from 7.5.1 to latest 7.10.0. Recently I upgraded first the apm-server(s) and logstash(s), next iteration we will upgrade elasticsearch and kibana.
    Everything look continue working fine but since upgrade I see that apm connections on :8200 are dropping frequently, see screenshot with behavior before/after the upgrade (upgrade done on 8:00 of 2 Dec, a few hours later connections are reset every few minutes):

Rest is working fine and I don't see errors indicating something bad, Is there anything I could check to see what is causing it? is it normal continues reset of connections since recent versions maybe?

Thanks in advance!

Hi @moixcruz,
the change in behavior is not expected for recent APM Server version upgrades. Could you maybe provide some more details:

  • which agent versions are you using and have you also updated them around the same time; if yes from which versions?
  • can the behavior be observed for connections from the java and the RUM agent?
  • do you see any errors or logs in the agents indicating any issues?

Thanks @simitt for your fast response

  • we have a wide variety of agents connected, java (I see versions from 1.9.0 to 1.16.0 connected), dotnet (1.5.1) and js/rum as well (4.4.4, 4.9.1 and 5.0.0). I maintain the backend of elastic apm and other teams are responsible of agents in product, so unfortunately this is something I cannot control.
  • I don't see anything wrong from agents, checked connections and logs
  • I can't see anything weird in agents. As said I have no access to them but I've asked for logs of some random agents and cannot see anything weird on them indicating recurrent disconnections

Can you please also check the APM Server logs? In case some issue exists between APM Server and logstash, the internal memory queue might fill up resulting in a 503 response from the APM Server. In this case an error is immediately returned to the agents and a new connection would be created. Although I would expect this to also show up in agent logs, might be worth checking.

Hi @simitt sorry for not answering before, I had some days off without access to the servers.
I can't find 503 in logs, however I can now see that the behavior of the tcp connections is now better since some days :flushed: see last 30 days where we see it fixed gradually several days after I opened this thread:

Could be caused by a network issue but very weird that it started happening exactly right after the upgrade

I will try asking network guys if there were something that could explain it, in any case thanks a lot for your help. I'll comment back with my findings

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.