Questions about sampling.tail.storage_limit configuration

Kibana version: 8.13.3

Elasticsearch version: 8.13.3

APM Server version: 8.13.3

APM Agent language and version: elastic-apm-agent-1.52.1.jar

apm-server.yml

apm-server:
  sampling.tail:
    enabled: true
    interval: 30s
    storage_limit: 100GB
    policies: 
        - sample_rate: .5
          trace.outcome: success
        - sample_rate: 1

path.data: /apm-server

Filesystem                   Size  Used Avail Use% Mounted on
/dev/mapper/centos-root      497G  127G  371G  26% /
tmpfs                         38G     0   38G   0% /run/user/0

[root@xxxxxx xxxxx]# du -sh *
0	apm-server.lock
4.0K	meta.json
87G	tail_sampling

I have three APM Server instances, the APM agent is configured with server_urls to automatically load balance across these three instances. Currently, more than 200 services are connected, and it is possible to connect about a thousand services in the future. Should I adjust the size of sampling.tail.storage_limit? It is currently in the root directory, do I need to adjust it to the data directory? In addition to affecting the disk, what other impacts will increasing this configuration have? Does this configuration have any protective measures, such as setting 100GB, where no more data will be written once it reaches 90GB?

{"log.level":"warn","@timestamp":"2025-02-28T11:40:29.451+0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).Run.func7","file.name":"sampling/processor.go","file.line":508},"message":"received error writing sampled trace: configured storage limit reached (current: 90431103187, limit: 90000000000)","service.name":"apm-server","ecs.version":"1.6.0"}

Yes. If the storage limit is reached all traces/transactions will be indexed. For Tail Based Sampling to work correctly the storage limit should not be reached. How much the adjustment should be depends on you use case and data patterns, the goal is not to see the configured storage limit reached error anymore.

Could you clarify what is in the root directory?

Will make Tail Based Sampling functionality correctly sample all traces: as mentioned above once the storage size is reached APM Server will index all the traces/transaction, which in practice work as applying a sampling policy of 1. From 8.17.1 we added a sampling.tail.discard_on_write_failure, default false, config option that allows to customize this behavior.

No, APM Server will use up to the configured value and try not to exceed it. It is expected for this value to be to slightly exceeded due to how storage calculations are performed.

Please note that across 8.x releases we worked on improving the performances and fixing some bugs of Tail Based Sampling feature, so there are benefits waiting you upgrading to the latest 8.x release. For example in 8.14.2 (release notes) and 8.17.1 (release notes). This improvements have positive impacts on high throughput environment, like yours.

Thank you for your reply!
However, due to the large number of services being connected, the throughput of the services was too large, and the CPU, memory, disk IO, and disk space overhead of the APM Server was too large, so I had to abandon tail-based sampling.

Can you provide some ballpark number of the overhead you observed? I'm assuming you are already running APM Server and decided to try enabling Tail Based Sampling.

Without more understanding of your setup (which is not the purpose of this forum) I can't have a clear answer but I'm surprised that the overhead is so big to make it desirable to ingest all traces, given that at that scale I expect the costs/resources involved in ingesting, storing and querying the full amount of data are higher.
I noticed that the policy was only sampling successful traces at 50% so maybe at that level the overhead is effectively too much (with tail based sampling the less you sample the bigger the overhead is). Any reason not to sample successes more aggressively?

Three APM Server instances are deployed on three physical machines. When more than 300 machines are accessed, the peak tpm reaches more than three hundred thousand.

In kibana's monitoring UI, we see the following overhead of an APM Server:
CPU up to 1500%
Memory usage of 80 gb (Common problems | Elastic Observability [8.13] | Elastic -io but it's still memory intensive)
Allocate 100GB of disk space, log often see

{"log.level":"warn","@timestamp":"2025-02-28T11:40:29.451+0800","log.logger":"sampling","log.origin":{"function":"github.com/elastic/apm-server/x-pack/apm-server/sampling.(*Processor).Run.func7","file.name":"sampling/processor.go","file.line":508},"message":"received error writing sampled trace: configured storage limit reached (current: 90431103187, limit: 90000000000)","service.name":"apm-server","ecs.version":"1.6.0"}

Machine resources are not sufficient to support tail sampling-based APM Server
As more services come in, I'll have to use head-based sampling
Tail-based sampling uses weighted random, so there's a better chance of collecting and storing slow requests, while head-based sampling is completely random, which is why I initially tried to enable tail-based sampling

1 Like

Thanks for the clarifications!

On storage size, we understand the issue. There are good news for the 9.0 release though, as we will release some performance improvements that will lead in using less disk space, a lower memory usage and overall better throughput.

In your configuration, with horizontally scaled APM Servers, the fix provided in 8.14.2 should provide a much better outlook on IO, throughput and memory usage. You may want to try again when you upgrade to that version.

So overall your choice to use Head Based Sampling is fair, but keep an eye on 9.0 where we expect much better resource utilization.

1 Like