looking for help/information about impact of custom Mule APM Agent on application performance.
Context:
we have a 7.12.1 Elastic Stack installed "on premise" on dedicated VMs in a Google Cloud Platform environment;
stack includes a standalone (legacy) APM Server binary, that is monitoring several systems via instances of the Java APM Agent (elastic-apm-agent-1.30.jar);
we need to monitor a MuleSoft application, that is also installed on dedicated VMs in a Google Cloud Platform environment;
the custom agent was installed and configured correctly in the MuleSoft environment, collecting data for the APM;
weeks later, the MuleSoft application's Dev team complained about some significant degradation in transactions latency, that they attributed to the addition of the Mule APM Agent;
as latency degradation reached a critical point, the Mule APM Agent installation was rolled back to prevent further issues;
the application appears to be performing nominally since;
because of circumstances about the Agent's rollback, we do not have reliable data about the performance differences among "before", "after" and "back again".
Questions:
does anyone have the custom Mule APM Agent installed in a Production environment?
if so, did you experience performance issues with Mulesoft (esp. transactions latency) as a consequence?
is anyone aware of any specific points for attention when setting up the Mule APM Agent, in addition to the documentation linked above?
can anyone point us to further resources on the Mule APM Agent?
The MuleSoft system where we had the transactions latency issue is managed by another team from another company, so, before we file a GitHub in order to speed things up we will have to ask them in advance for the information you may need to make sense of the situation.
We have prepared this list of items that we think would be relevant to your analysis; please feel free to add and/or modify as necessary, and we will immediately forward the request to the MuleSoft team on your behalf.
MuleSoft version
List of Mulesoft components installed
for each module, number of Virtual Machines where it is installed (and what modules share the same VM)
for each Virtual Machine: OS version, JVM version, Disk size, Memory size, number of cores, etc;
for each Virtual Machine: current resources (disk, memory, CPU) usage (after removal of the Agent);
for each Virtual Machine: resources usage (disk, memory, CPU) when the issue occurred;
other?
On "our" side of the integration, the Elastic stack is configured as follows:
all modules are Elastic version 7.12.1 (Platinum Licence), running on the default OpenJDK 11 JVM on Debian 10 Virtual Machines on Google Cloud Platform;
standalone (legacy) APM Server binary (1 VM, shared with Heartbeat);
Heartbeat (1 VM, shared with the APM Server);
Logstash (4 VMs, shared with Kibana);
Kibana (4 VMs, shared with Logstash);
Elasticsearch (10 VMs);
The APM Server is currently monitoring 4 other external applications via the standard Java APM agent, with no issues either side;
THANK YOU for taking the time to look into the issue and apply improvements.
The Mulesoft team sent us some information about their Production environment this morning, if that may still be of use to you.
Their application runs on Mulesoft Runtime Standalone (version 3.8.4), installed on a total of 10 CentOS Linux machines (release 7.9.2009), each equipped with 8 CPUs, 48 GB RAM and 80 GB HD. The 10 nodes are organised as follows:
a Cluster with 2 nodes (Java 1.8.0_191);
a Server Group with 6 nodes (Java 1.8.0_322);
2 backup nodes (Java 1.8.0_322);
We are going to suggest that they try to install the improved v1.21.1 of your Agent, and see if the latency issue is solved.
However, they also added that they are preparing to migrate the whole application to the later Mulesoft RTF Runtime 3.9.5 version.
The RTF Runtime would run on Java 1.8.0_282 on a total of 18 RHEL 8.4 machines, each equipped with 2 CPUs, 16 GB RAM. The nodes will be organised as follows:
3 Controller nodes, with 3 HD each: 80 GB (OS), 250 GB (Docker), 60 GB (etcd);
15 Worker nodes, with 2 HD each: 80 GB (OS), 250 GB (Docker);
Would your elastic-apm-mule3-agent v1.21.1 be still compatible with that?
For Mule 3 there are 2 versions of the agent - one for version 3.8 and another one for version 3.9. I applied the fix in both.
Also, I recommend testing the fix first to validate it fixes the issue in your environment. I used the Eclipse memory analyzer (Eclipse Memory Analyzer Open Source Project | The Eclipse Foundation) to confirm that SpanStore class was indeed the memory leak culprit in v1.19 but didn't appear to be causing issues with v1.21
Another reason to test is the upgrade to the Java APM agent I made from 1.10 to 1.21
Let me know how it goes and feel free to raise issues directly in the github repo.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.