Server fail over, suggested method

LordMathis · December 13, 2019, 8:16pm

Kibana version: 7.3

Elasticsearch version: 7.3

APM Server version: 7.3

APM Agent language and version: Intake API

Our team was wondering what you all believe is the best method of fail over for the APM servers. We are alright with loosing the packets when the server goes down, that is not the issue, but our agent implementation was using HTTP calls. This meant that when the server went down all of our applications started doing blocking http calls that persisted until the http timeout.

We mitigated this problem by putting all of our http sends into a thread. This allows our send to go down without effecting the user. Is this how it is handled in the other agents? What do you recommend to make certain that the server going down doesn't effect the code APM is monitoring?

Sergey_Kleyman · December 15, 2019, 10:25pm

I think different agents handle it differently - especially considering that some agents runtimes don't have threads (for example JavaScript based runtimes - RUM and node.js agents). The two possible approaches are: (1) blocking I/O with dedicated thread and (2) asynchronous I/O. The advantage of the asynchronous I/O approach, assuming that of course that you runtime provides asynchronous HTTP client, is that you can execute multiple tasks without wasting a blocked thread on each one. For example, in the future you might want to integrate your agent with APM Agent configuration - if you use HTTP client that is blocking you will need yet another thread for that task. Of course, the difference between asynchronous and blocking approaches are not that significant in this case (unlike let's say web server trying to serve thousands of clients) since agent most likely won't need to run more than a handful of tasks concurrently so if you find it easier to solve the problem by offloading communication with APM Server to another thread(s), but still using the same blocking HTTP client on the dedicated thread, it will work.

system · January 5, 2020, 6:25pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Heavy CPU usage in APM Agents when APM-servers goes down APM	7	1070	February 5, 2019
Peridiocally Java APM Agent experiences errors with connection to APM server APM java	7	797	August 10, 2023
Making sure no logs are lost APM nodejs , rum	3	495	October 19, 2019
Sending payload to APM server failed APM java , server	6	575	August 30, 2022
APM server behind nginx closes connection prematurely APM	2	2094	July 4, 2018

Server fail over, suggested method

Related topics