Original install method (e.g. download page, yum, deb, from source, etc.) and version: yum
Fresh install or upgraded from other version? apm upgraded from 7.5.1 to 7.10.0
Is there anything special in your setup? we use logstash in between apm-servers and elasticsearch for cache and pipelines such as user agent or geolocation
Description of the problem including expected versus actual behavior.:
I'm upgrading the apm stack from 7.5.1 to latest 7.10.0. Recently I upgraded first the apm-server(s) and logstash(s), next iteration we will upgrade elasticsearch and kibana.
Everything look continue working fine but since upgrade I see that apm connections on :8200 are dropping frequently, see screenshot with behavior before/after the upgrade (upgrade done on 8:00 of 2 Dec, a few hours later connections are reset every few minutes):
Rest is working fine and I don't see errors indicating something bad, Is there anything I could check to see what is causing it? is it normal continues reset of connections since recent versions maybe?
we have a wide variety of agents connected, java (I see versions from 1.9.0 to 1.16.0 connected), dotnet (1.5.1) and js/rum as well (4.4.4, 4.9.1 and 5.0.0). I maintain the backend of elastic apm and other teams are responsible of agents in product, so unfortunately this is something I cannot control.
I don't see anything wrong from agents, checked connections and logs
I can't see anything weird in agents. As said I have no access to them but I've asked for logs of some random agents and cannot see anything weird on them indicating recurrent disconnections
Can you please also check the APM Server logs? In case some issue exists between APM Server and logstash, the internal memory queue might fill up resulting in a 503 response from the APM Server. In this case an error is immediately returned to the agents and a new connection would be created. Although I would expect this to also show up in agent logs, might be worth checking.
Hi @simitt sorry for not answering before, I had some days off without access to the servers.
I can't find 503 in logs, however I can now see that the behavior of the tcp connections is now better since some days see last 30 days where we see it fixed gradually several days after I opened this thread:
Could be caused by a network issue but very weird that it started happening exactly right after the upgrade
I will try asking network guys if there were something that could explain it, in any case thanks a lot for your help. I'll comment back with my findings
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.