INFO - Elastic APM Mule3 Agent impact on Mule performances (latency)

gchiartosini · April 27, 2022, 1:20pm

Summary:

looking for help/information about impact of custom Mule APM Agent on application performance.

Context:

we have a 7.12.1 Elastic Stack installed "on premise" on dedicated VMs in a Google Cloud Platform environment;
stack includes a standalone (legacy) APM Server binary, that is monitoring several systems via instances of the Java APM Agent (elastic-apm-agent-1.30.jar);
we need to monitor a MuleSoft application, that is also installed on dedicated VMs in a Google Cloud Platform environment;

Problem:

since the Java APM Agent can not work correctly with MuleSoft applications, we were directed to Michael Hyatt's custom Mule APM Agent (elastic-apm-mule3-agent) @Michael_Hyatt, see references:
-- https://twitter.com/elastic/status/1133404592064983041
-- Monitoring Mule flows with Elastic APM and the Elastic Stack | Elastic Blog
-- GitHub - michaelhyatt/elastic-apm-mule3-agent: Elastic APM agent for Mule 3.x
the custom agent was installed and configured correctly in the MuleSoft environment, collecting data for the APM;
weeks later, the MuleSoft application's Dev team complained about some significant degradation in transactions latency, that they attributed to the addition of the Mule APM Agent;
as latency degradation reached a critical point, the Mule APM Agent installation was rolled back to prevent further issues;
the application appears to be performing nominally since;
because of circumstances about the Agent's rollback, we do not have reliable data about the performance differences among "before", "after" and "back again".

Questions:

does anyone have the custom Mule APM Agent installed in a Production environment?
if so, did you experience performance issues with Mulesoft (esp. transactions latency) as a consequence?
is anyone aware of any specific points for attention when setting up the Mule APM Agent, in addition to the documentation linked above?
can anyone point us to further resources on the Mule APM Agent?

Thanks in advance everyone.

Michael_Hyatt · April 27, 2022, 10:45pm

Hi Giada,

Feel free to raise an issue in the Github repo and I will be happy to have a look at the issues you are seeing:

Can you please help me reproduce the degradation?

BTW, this is my hobby project rather than something officially supported by Elastic.

riccup · April 29, 2022, 3:33pm

Michael, thank you for reaching out.

The MuleSoft system where we had the transactions latency issue is managed by another team from another company, so, before we file a GitHub in order to speed things up we will have to ask them in advance for the information you may need to make sense of the situation.

We have prepared this list of items that we think would be relevant to your analysis; please feel free to add and/or modify as necessary, and we will immediately forward the request to the MuleSoft team on your behalf.

MuleSoft version
List of Mulesoft components installed
for each module, number of Virtual Machines where it is installed (and what modules share the same VM)
for each Virtual Machine: OS version, JVM version, Disk size, Memory size, number of cores, etc;
for each Virtual Machine: current resources (disk, memory, CPU) usage (after removal of the Agent);
for each Virtual Machine: resources usage (disk, memory, CPU) when the issue occurred;
other?

On "our" side of the integration, the Elastic stack is configured as follows:

all modules are Elastic version 7.12.1 (Platinum Licence), running on the default OpenJDK 11 JVM on Debian 10 Virtual Machines on Google Cloud Platform;
standalone (legacy) APM Server binary (1 VM, shared with Heartbeat);
Heartbeat (1 VM, shared with the APM Server);
Logstash (4 VMs, shared with Kibana);
Kibana (4 VMs, shared with Logstash);
Elasticsearch (10 VMs);
The APM Server is currently monitoring 4 other external applications via the standard Java APM agent, with no issues either side;

Thank you very much.

gchiartosini · May 11, 2022, 10:52am

Hi Micheal,
we requested the information in the list above to the Mulesoft team.
As soon as we have the info we will open the issue on github.

Thanks in advance.

Michael_Hyatt · May 25, 2022, 12:02am

BTW, I found the potential source of the memory leak, there is a new release of the agent, v1.21.1 that addresses this issue.

riccup · May 25, 2022, 2:27pm

Hello, Michael.

THANK YOU for taking the time to look into the issue and apply improvements.

The Mulesoft team sent us some information about their Production environment this morning, if that may still be of use to you.

Their application runs on Mulesoft Runtime Standalone (version 3.8.4), installed on a total of 10 CentOS Linux machines (release 7.9.2009), each equipped with 8 CPUs, 48 GB RAM and 80 GB HD. The 10 nodes are organised as follows:

a Cluster with 2 nodes (Java 1.8.0_191);
a Server Group with 6 nodes (Java 1.8.0_322);
2 backup nodes (Java 1.8.0_322);

We are going to suggest that they try to install the improved v1.21.1 of your Agent, and see if the latency issue is solved.

However, they also added that they are preparing to migrate the whole application to the later Mulesoft RTF Runtime 3.9.5 version.

The RTF Runtime would run on Java 1.8.0_282 on a total of 18 RHEL 8.4 machines, each equipped with 2 CPUs, 16 GB RAM. The nodes will be organised as follows:

3 Controller nodes, with 3 HD each: 80 GB (OS), 250 GB (Docker), 60 GB (etcd);
15 Worker nodes, with 2 HD each: 80 GB (OS), 250 GB (Docker);

Would your elastic-apm-mule3-agent v1.21.1 be still compatible with that?

Thank you immensely for your support.

Michael_Hyatt · May 25, 2022, 11:40pm

For Mule 3 there are 2 versions of the agent - one for version 3.8 and another one for version 3.9. I applied the fix in both.

Also, I recommend testing the fix first to validate it fixes the issue in your environment. I used the Eclipse memory analyzer (Eclipse Memory Analyzer Open Source Project | The Eclipse Foundation) to confirm that SpanStore class was indeed the memory leak culprit in v1.19 but didn't appear to be causing issues with v1.21

Another reason to test is the upgrade to the Java APM agent I made from 1.10 to 1.21

Let me know how it goes and feel free to raise issues directly in the github repo.

system · June 15, 2022, 7:40pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Observability APM with Mule 3 APM java	3	364	August 18, 2020
[APM][JAVA] Solutions, best practices and APM test guidelines APM java	2	507	November 5, 2019
Elastic APM APM	5	1028	March 9, 2018
Java agent APM	9	5273	January 9, 2018
A POC for APM Java agent APM	1	1479	December 26, 2017

INFO - Elastic APM Mule3 Agent impact on Mule performances (latency)

Related topics