INFO - Elastic APM Mule3 Agent impact on Mule performances (latency)

Summary:

  • looking for help/information about impact of custom Mule APM Agent on application performance.

Context:

  • we have a 7.12.1 Elastic Stack installed "on premise" on dedicated VMs in a Google Cloud Platform environment;
  • stack includes a standalone (legacy) APM Server binary, that is monitoring several systems via instances of the Java APM Agent (elastic-apm-agent-1.30.jar);
  • we need to monitor a MuleSoft application, that is also installed on dedicated VMs in a Google Cloud Platform environment;

Problem:

Questions:

  • does anyone have the custom Mule APM Agent installed in a Production environment?
  • if so, did you experience performance issues with Mulesoft (esp. transactions latency) as a consequence?
  • is anyone aware of any specific points for attention when setting up the Mule APM Agent, in addition to the documentation linked above?
  • can anyone point us to further resources on the Mule APM Agent?

Thanks in advance everyone.

1 Like

Hi Giada,

Feel free to raise an issue in the Github repo and I will be happy to have a look at the issues you are seeing:

Can you please help me reproduce the degradation?

BTW, this is my hobby project rather than something officially supported by Elastic.

Michael, thank you for reaching out.

The MuleSoft system where we had the transactions latency issue is managed by another team from another company, so, before we file a GitHub in order to speed things up we will have to ask them in advance for the information you may need to make sense of the situation.

We have prepared this list of items that we think would be relevant to your analysis; please feel free to add and/or modify as necessary, and we will immediately forward the request to the MuleSoft team on your behalf.

  • MuleSoft version
  • List of Mulesoft components installed
  • for each module, number of Virtual Machines where it is installed (and what modules share the same VM)
  • for each Virtual Machine: OS version, JVM version, Disk size, Memory size, number of cores, etc;
  • for each Virtual Machine: current resources (disk, memory, CPU) usage (after removal of the Agent);
  • for each Virtual Machine: resources usage (disk, memory, CPU) when the issue occurred;
  • other?

On "our" side of the integration, the Elastic stack is configured as follows:

  • all modules are Elastic version 7.12.1 (Platinum Licence), running on the default OpenJDK 11 JVM on Debian 10 Virtual Machines on Google Cloud Platform;
  • standalone (legacy) APM Server binary (1 VM, shared with Heartbeat);
  • Heartbeat (1 VM, shared with the APM Server);
  • Logstash (4 VMs, shared with Kibana);
  • Kibana (4 VMs, shared with Logstash);
  • Elasticsearch (10 VMs);
  • The APM Server is currently monitoring 4 other external applications via the standard Java APM agent, with no issues either side;

Thank you very much.

Hi Micheal,
we requested the information in the list above to the Mulesoft team.
As soon as we have the info we will open the issue on github.

Thanks in advance.

BTW, I found the potential source of the memory leak, there is a new release of the agent, v1.21.1 that addresses this issue.

Hello, Michael.

THANK YOU for taking the time to look into the issue and apply improvements.

The Mulesoft team sent us some information about their Production environment this morning, if that may still be of use to you.

Their application runs on Mulesoft Runtime Standalone (version 3.8.4), installed on a total of 10 CentOS Linux machines (release 7.9.2009), each equipped with 8 CPUs, 48 GB RAM and 80 GB HD. The 10 nodes are organised as follows:

  • a Cluster with 2 nodes (Java 1.8.0_191);
  • a Server Group with 6 nodes (Java 1.8.0_322);
  • 2 backup nodes (Java 1.8.0_322);

We are going to suggest that they try to install the improved v1.21.1 of your Agent, and see if the latency issue is solved.

However, they also added that they are preparing to migrate the whole application to the later Mulesoft RTF Runtime 3.9.5 version.

The RTF Runtime would run on Java 1.8.0_282 on a total of 18 RHEL 8.4 machines, each equipped with 2 CPUs, 16 GB RAM. The nodes will be organised as follows:

  • 3 Controller nodes, with 3 HD each: 80 GB (OS), 250 GB (Docker), 60 GB (etcd);
  • 15 Worker nodes, with 2 HD each: 80 GB (OS), 250 GB (Docker);

Would your elastic-apm-mule3-agent v1.21.1 be still compatible with that?

Thank you immensely for your support.

For Mule 3 there are 2 versions of the agent - one for version 3.8 and another one for version 3.9. I applied the fix in both.

Also, I recommend testing the fix first to validate it fixes the issue in your environment. I used the Eclipse memory analyzer (Eclipse Memory Analyzer Open Source Project | The Eclipse Foundation) to confirm that SpanStore class was indeed the memory leak culprit in v1.19 but didn't appear to be causing issues with v1.21

Another reason to test is the upgrade to the Java APM agent I made from 1.10 to 1.21

Let me know how it goes and feel free to raise issues directly in the github repo.