Traces for some transactions not being sampled despite 100% Sampling Rate

Kibana version: 7.17

Elasticsearch version: 7.17

APM Server version: 7.17

APM Agent language and version: Java | 1.32.0

Original install method (e.g. download page, yum, deb, from source, etc.) and version: Everything is containerized running on K8s.

Fresh install or upgraded from other version? Upgraded from an older version.

Is there anything special in your setup? For example, are you using the Logstash or Kafka outputs? Are you using a load balancer in front of the APM Servers? Have you changed index pattern, generated custom templates, changed agent configuration etc.

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

For a particular service, traces for a few particular transactions are not getting sampled at all, and hence, the trace timeline is not available for those transactions.

I've already verified that the sampling rate for the service is 100%. And other transactions for the service are getting successfully sampled and hence, I've the trace timeline available for those.

I've noticed that the transactions that aren't getting sampled have below things in common:

event.outcome is unknown
http.response.finished is - (Null)

Whereas for the transactions for which traces are getting sampled, event.outcome is either success or failure and http.response.finished is true.

Yes, it is a distributed tracing system and other services are also instrumented via Elastic APM Agent(Same version, same language) and those also have sampling 100%.

What's weird is that for a particular service itself, traces for few transactions are not getting sampled at all whereas other transactions are working fine.

Provide logs and/or server output (if relevant):
Let me know what all information would be required?

Hiya. The
http.response.finished is - (Null)
implies that the response failed to complete. This would suggest some kind of failure happening during the http response. It's not immediately clear to me where that would be happening, it could be the application or the agent. Have you tried turning on DEBUG and looking for events or traces related to these failures?

I haven't enabled Debug logs in the agent, but I was at a very older version of 1.15.0 earlier and upgraded to 1.32.0 to see if it was an issue with the agent. The same thing has been happening for both versions.

As I said, you need to look for some kind of failure in the transactions, I'll have no way of detecting that from the description. It could be a connection failure, thread death, or some kind of concurrency bug, but first step would be turning on DEBUG

I've enabled the DEBUG logs for APM Agent. What kind of failures should I look for. I do see a few read timeouts while writing to the APM server, but those are very few and I see those for other services as well for which trace timeline available is available for all the transactions.

Actually, since this is an absence of something that should be there, I think you'll need to set it to TRACE (but that generates a LOT of output). Then look for the messages that have http.response.finished set to null and track back the transaction IDs to where it was started and follow that through, then compare that against one where the transaction is valid, and look for the difference in thread activity - in particular what happens to the threads

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.