Dynamic Sampling Rates

Hello again!

We are wondering if Elastic APM has a way to enable dynamic sampling, meaning having the ability to change sampling rates on the fly without needing to restart services and agents.

One of the use case for us is that in the advent of transaction errors, we can increase sampling rates.
Another use case is to lower the sampling rates if there are sustained traffic spikes.

We're hoping there's a hook somewhere in the agent that we can tap into that would allow us to do this.

Many thanks!

Ronald

Hi Ronald,

We are thinking for some time about how to approach this feature, but we mainly discuss how we can provide automatic adaptive sampling, so you would get it applied out of the box.

Are you looking for some API that will enable you implement the sampling logic on your own? If so, how would you imagine such an API to look like?

Thanks,
Eyal.

Hi Eyal,

Thanks for picking up this discussion.

From our point of view here are some ideas to go about it:

  1. A request-response between agent and APM Server. This could be done as a configurable polling frequency (e.g. apm-server.polling.period=sss seconds, default to no polling) with the sampling rate set up via Kibana in the backend. This polling can also be theoretically extended for other config settings that could be changed. Agent can receive a key-value pair for the sampling, like {"agent_sampling": 0.20} for 20%

  2. An API which can be triggered from the application side like ElasticApm.setSampling(0.20)

  3. A utility/service that can be executed to dynamically change the sampling rates: apm-server --sample-rate=0.20

We're also interested in automatic adaptive sampling but not sure how to fit that in potentially wildly varying use cases.

Regards,

Ronald

Hi Ronald,

Thanks for your valuable suggestions! Here is my input:

A request-response between agent and APM Server. This could be done as a configurable polling frequency (e.g. apm-server.polling.period=sss seconds, default to no polling) with the sampling rate set up via Kibana in the backend. This polling can also be theoretically extended for other config settings that could be changed. Agent can receive a key-value pair for the sampling, like {"agent_sampling": 0.20} for 20%

This is in our roadmap, not specific to sampling, but as you wrote- for dynamically updating all sorts of configuration options.

An API which can be triggered from the application side like ElasticApm.setSampling(0.20)

Well, this is an option, but I assume it will lack the flexibility you want for the use cases you described. So if you encounter errors, would you like to increase sampling of everything? Or if you see a lot of traffic, 95% of which is happening on a single page of your application, would you like to reduce to 0.1, thus making capturing of other traces very rare?
The point is it is a tricky thing, but the idea of adding a flexible API may be interesting and I will bring it up.

A utility/service that can be executed to dynamically change the sampling rates: apm-server --sample-rate=0.20

Note that there is already an option for manually change the sample rate locally- see the transaction_sample_rate configuration documentation- it says it is dynamic==true, which means that if you use the configuration properties file, you can adjust it on the fly. More on that here.

I hope this helps.
Eyal.

Hi Eyal,

Many thanks for your responses.
The first option is what we really would like to have as we could easily manage sampling rates with even just our monitoring team, and, if there is some sort of API around it, we could plug it into our existing monitoring tools.

The second option is not really ideal as it is a bit intrusive to code, but we only brought it up in case it was an easier option, say if it could piggyback on the existing library APIs.

As for the third option, my understanding is that this is something set at service launch and if there is a need to change it, we have to restart the service. Our thinking is that if there is something similar to a sysctml service reload that does not restart the service (because there might be running processes or connections, and we have no time to wait to drain those), and only reloads the agent config then that would be a way to react to extreme situations. I hope this made sense :slight_smile:

Kind regards,

Ronald

Ronald,

Please see the links I sent. You can edit the configuration file for certain properties that are stated dynamic and the new configuration will take effect without restart.

Eyal.

Oh thanks for that Eyal. For some reason that didn't register with us, we somehow missed it.
We'll start running tests on that one ASAP.

Regards,

Ronald

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.