Server fail over, suggested method

Kibana version: 7.3

Elasticsearch version: 7.3

APM Server version: 7.3

APM Agent language and version: Intake API

Our team was wondering what you all believe is the best method of fail over for the APM servers. We are alright with loosing the packets when the server goes down, that is not the issue, but our agent implementation was using HTTP calls. This meant that when the server went down all of our applications started doing blocking http calls that persisted until the http timeout.

We mitigated this problem by putting all of our http sends into a thread. This allows our send to go down without effecting the user. Is this how it is handled in the other agents? What do you recommend to make certain that the server going down doesn't effect the code APM is monitoring?

I think different agents handle it differently - especially considering that some agents runtimes don't have threads (for example JavaScript based runtimes - RUM and node.js agents). The two possible approaches are: (1) blocking I/O with dedicated thread and (2) asynchronous I/O. The advantage of the asynchronous I/O approach, assuming that of course that you runtime provides asynchronous HTTP client, is that you can execute multiple tasks without wasting a blocked thread on each one. For example, in the future you might want to integrate your agent with APM Agent configuration - if you use HTTP client that is blocking you will need yet another thread for that task. Of course, the difference between asynchronous and blocking approaches are not that significant in this case (unlike let's say web server trying to serve thousands of clients) since agent most likely won't need to run more than a handful of tasks concurrently so if you find it easier to solve the problem by offloading communication with APM Server to another thread(s), but still using the same blocking HTTP client on the dedicated thread, it will work.

1 Like

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.