If you are asking about a problem you are experiencing, please use the following template, as it will help us help you. If you have a different problem, please delete all of this text
Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs? Are you using a load balancer in front of the APM Servers? Have you changed index pattern, generated custom templates, changed agent configuration etc.
No
Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):
We've got RUM(JS) agent running in our Rails application and it was running fine. Recently we've had some issues with our stack due to elasticsearch storage filling up and not being able to accept new index creation. The problem that I want help with is the fact that with elasticsearch not being able to accept new events the RUM(JS) library started throwing 503s with "Failed sending transactions" and these errors were preventing the rest of the JS in the page to load properly thus breaking user experience. Shouldn't the library catch said errors to prevent it to disrupt execution context and log it only? We are currently fine tuning our storage process and as such we might run into this problem again, and we can't have the user experience being affected to it. What do you recommend?
Thanks for reporting the issue.
The RUM agent should not interfere with your application even if the agent cannot send any data to the APM server. Currently the agent only logs the error if the request to send the APM data to APM server fails. Would you please explain a bit more about how the agent prevents the rest of the page to load properly? Please provide parts of your code as well if relevant/possible?
As a reminder, we have moved our npm package to @elastic/apm-rum and version 4.1.1 is our latest version. Please make sure to use our latest version as we have made many bug fixes and improvements.
We've got a Rails app that serves a few pages with ReactJS. When we added the elasticApm and the server went through some storage issues (couldn't accept new data) the library started throwing the 503, and the React app for certain pages stopped loading properly. At first look they don't seem to share the same execution context so any JS breaks shouldn't interfere with one another but the results seems to indicate otherwise. I haven't looked at the library yet, but I was wondering if the apm client is registering callbacks in web events that could potentially be shared with some React lib, and they are interfering with one another.
There is nothing special about our setup. We inject our ElasticAPM snippet as a Rails HAML view.
We do patch some of browser API such as fetch and XMLHttpRequest, however these patches should not interfere with the application. Furthermore, the agent actually handles 503 errors from APM server and only logs them to the console but should not otherwise have any effect on the application.
Is it possible for you to create a reproduction script online (e.g. on codesandbox.io)?
@Hamidreza I've tried to simulate the problem locally yesterday to further investigate the issue but couldn't replicate the behavior just yet. I was trying to set the indexes as read_only_allow_delete, which is the elasticsearch behavior for when the cluster.routing.allocation.disk.watermark.flood_stage threshold gets reached (which was our original issue) but somehow even though the indexes were not being populated the rum.js agent wasn't logging anything and the POST request to /events were succeeding. Do you have any other ideas on how I can simulate those requests to fail locally?
Since the RUM agent communicates with the APM Server and not directly with Elasticsearch, it should be enough to simulate a non responding Elasticsearch instance from the APM Server. The server uses an internal memory queue to temporarily buffer events before writing to ES, to e.g. overcome temporary connection problems. Configuring the server trying to write to a non reachable ES instance will result in a full memory queue over time. When the queue is full, the server returns a 503.
You can simulate that by changing following settings in the apm-server.yml:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.