Our team was wondering what you all believe is the best method of fail over for the APM servers. We are alright with loosing the packets when the server goes down, that is not the issue, but our agent implementation was using HTTP calls. This meant that when the server went down all of our applications started doing blocking http calls that persisted until the http timeout.
We mitigated this problem by putting all of our http sends into a thread. This allows our send to go down without effecting the user. Is this how it is handled in the other agents? What do you recommend to make certain that the server going down doesn't effect the code APM is monitoring?
I think different agents handle it differently - especially considering that some agents runtimes don't have threads (for example JavaScript based runtimes - RUM and node.js agents). The two possible approaches are: (1) blocking I/O with dedicated thread and (2) asynchronous I/O. The advantage of the asynchronous I/O approach, assuming that of course that you runtime provides asynchronous HTTP client, is that you can execute multiple tasks without wasting a blocked thread on each one. For example, in the future you might want to integrate your agent with APM Agent configuration - if you use HTTP client that is blocking you will need yet another thread for that task. Of course, the difference between asynchronous and blocking approaches are not that significant in this case (unlike let's say web server trying to serve thousands of clients) since agent most likely won't need to run more than a handful of tasks concurrently so if you find it easier to solve the problem by offloading communication with APM Server to another thread(s), but still using the same blocking HTTP client on the dedicated thread, it will work.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.