[heartbeat]how to get performace test on heartbeat

Hi Elastic friends,
Have a question for the performance of heartbeat.
Assuming there’re 90K targets need to be monitored via heartbeat, in some case, like there’re 30% to 50% targets, maybe even worse, there’re 60%~80% targets are very slow to access from heartbeat. what will happen to heartbeat? And is there any easy way to simulate this scenario?
We want to reproduce this scenario without losing any workload when targets count is nearly 90K.

Any suggestions?

Hello Guys, any thoughts here?

Apologies for the delay. Assuming HTTP or TCP, If some hosts are slower than others the main consequence of that will be heartbeat will will keep additional file descriptors allocated for each host. This will require some level of benchmarking on your end if you want to be confident that it will work fine.

t's a tough thing to simulate but it is possible. What I would do is spin up a few VM's on the cloud as test targets and send 20,000 monitors to each. I would just run nginx on the VM's. Then, I would generate a simple heartbeat config with the full list of monitors. To simulate a slow connection you could use qdiscs with linux. For an example see: linux - Simulating a slow connection with tc - Server Fault .

You may also want to set heartbeat.scheduler.limit (see docs).

ICMP does not require the allocation of file descriptors so it scales much further.

We'd be really curious to see the results of a test like this!

Hi Andrew, thanks for your suggestion.

However, I was wondering if we could add an option from heartbeat side, like delay as the same level as timeout ?
That means HB has to wait for the delay period then it starts the check for every single icmp/port/url check.
Then we can set the value of delay very close to timeout , once we do that for the targets (like 30%, 40% of all targets) we want to test on, in this case, heartbeat will unable to get the response for them in time.

Appreciate if you could comment on this solution.

Hello Andrew, very appreciated if you could comment on our proposal.

Apologies for the delay, I'm just back from vacation :-).

I'm unclear as to what the differences between delay and the current schedule option.

Given this example:
If we have a config like this:

schedule: 36 * * * * * *
hosts: target
timeout: 10s

The current logic is:
This target runs at 36 min every hr with the timeout as 10s.
That means if this target cannot finish at 36:10, it returns as timeout.

After adding a delay=10s config to the above config,
even this target is scheduled to run at 36:00, it starts at 36:10 actually since there's 10s delay, after all it returns timeout since it the timeout is up.

We want this delay option always engineer timeout response for the specific targets in case we do such tests. This option also guarantees we could easy to control which targets will be timed out that is tc command unable to get around.

Hope my description is clear this time.

Is your main question around controlling concurrency? or something else? I still don't understand the 'why' behind your question. It sounds like scheduler.limit does not solve this problem. Is that right?

Sorry for the confusion.
My main question is that:
How to ensure the icmp/tcp/http checks on some dedicated targets always return as down=1 for the lnp test?
Given the example, if we have the target hosts named from t0001 ... to t1000,
we want the icmp/tcp/http checks for t0001 to t0500 as down=1, the others fall into the normal flow.

We want this kind of response for dedicated targets, not for percentage.
That's why we consider to add an option delaysince it can make sure the checks of that target are fail and it can be added to the specific target.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.