Transaction samples are not shown in APM UI

Kibana version : 6.8.0

Elasticsearch version : 6.8.0

APM Server version : 6.8.0

APM Agent language and version : JAVA apm-agent-attach:1.17.0, apm-agent-api:1.17.0

Java version : openjdk:11

We have Spring-boot application with following APM config:

service_name=app-name
server_urls=http://apm-server-ops.app.com
transaction_sample_rate=0.05
application_packages=com.app.package
ignore_urls:/actuator*, /swagger-ui*
enable_log_correlation=true

The application is not very loaded and it processes 1-2 requests per 30sec for an entity creation (persists in the DB). The request comes from the other (upper-steam) application. Once the entity is created the application emits an event to ActiveMQ which is picked up by five-six ActiveMQ listeners which live in the same application.

The problem was that though we set transaction_sample_rate=0.05 , we don't have any samples in apm UI

neither for http requests (Filter by type:request) not for activemq transactions (Filter by type:messagin)

First, I though the problem was caused by warning in the logs:

2020-09-17 14:15:30,649 [DefaultMessageListenerContainer-2] WARN  co.elastic.apm.agent.impl.transaction.Span - 
Max spans (500) for transaction 'JMS RECEIVE from queue Consumer.SmsServiceImpl.VirtualTopic.****CreatedEvent' 00-6efa3ce85265c5f6d3f8d53feed3d11f-acdeb259e07a8286-00 (5084b178) has been reached. 
For this transaction and possibly others, further spans will be dropped. See config param 'transaction_max_spans'.

but from my understanding, even if we were loosing SOME spans (as the warning says), we would receive some one them which are not lost, so we would see at least some request samples

So I've tried to increase transaction_max_spans=1000, just to try it. The warnings are almost gone now.

But we are still getting 0 transaction samples.

Now I am thinking in this direction:
we have enable_log_correlation=true in the upper-stream system with transaction_sample_rate=0
Can it cause that in our (downstream) application this leads to 0 transaction samples in APM?

Can someone help me to understand how it works please and in which direction to look?

Hi @Svetlana_Nikitina,

Your application is very lightly loaded in number of transactions, with 0.05 sample rate and about 4 requests / minute, thus most of your transactions aren't sampled at all, with such settings that means only 12 of them will be sampled per hour.

What "non-sampled" means is that for those transactions, we only capture the transaction duration, which could be very short if the application entry point only sends a message to an asynchronous queue.

When a transaction is sampled, we capture all spans (requests to DB, delegation to another service, rpc calls, ...) up to the limit of 500, which you seem to reach quite often.

Sample rate < 1.0 is mostly used to optimize storage/bandwidth/overhead costs at the price of accuracy. While it works quite well when the number of transactions is large, with a low number you will get really inaccurate measurements.

Thus I suggest to do the following:

  • set sample_rate to 1.0 (just remove the setting as it's the default value)
  • set span_min_duration to filter out short-lived spans, which will help to reduce the number of spans per transaction to a reasonable level (being able to display them in Kibana == OK)
  • set transaction_max_spans as last resort if the span count is still too high.

Also enable_log_correlation will only enable to insert active transaction ID in logs, see doc for details.

wow! thanks for such a detailed answer!
yes, seems I had not very clear understanding of configuration :frowning:
I will try suggested solution

okay, i did some tests
i've updated our config with:

    transaction_sample_rate=1
    enable_log_correlation=true
    transaction_max_spans=1000
    span_min_duration=500ms

this reduced a bit the number of warnings (i might increase span_min_duration even more, as we still have some in the evening, when the load on the application increases)
however this didn't fix the problems with missing samples :frowning:

what i did next - i've tried making some requests from the postman directly, avoiding calls from our upstream application, which has transaction_sample_rate=0
and surprisingly ALL these calls were sampled and displayed correctly in APM

with this said - i am still blaming upsteam application which makes calls to our application, having sample rate disabled.
are you sure that if we have such flow (as defined below) we should still get samples in application B:

app A (transaction_sample_rate=0 enable_log_correlation=true) ----> appB (transaction_sample_rate=1 enable_log_correlation=true)

do you have any ideas why it might happen?

I could try playing around with upstream application (setting sample rate > 0 there) , but the problem, that it's REALLY highly loaded, as we don't want to have ANY performance side-effects, therefore it was intentionally set to 0 and apm is used there only for adding trace.id to logs, so we could search the logs by trace.id between these 2 applications

upd:
i've set transaction_sample_rate=0.5 in our upstream app (not very desired config though) and voila - i've started to get samples in our downstream app :expressionless:

is it anyhow expected behaviour?

Yes, that's due to how distributed tracing works here

  • upstream application starts a distributed transaction (aka "root", because it has no parent), and propagates the transaction ID + some metadata through HTTP layers to the downstream application
  • downstream application will sample the transaction using the upstream transaction as parent if the upstream transaction was sampled.

In other words, the upstream application decides what gets sampled or not globally, and we don't support different rates (that would likely create complications to compute weighted metrics).

However, I just had confirmation that log correlation is independent from sampling, thus even if transactions are not sampled, you will have all the correlation IDs in the logs for both applications.

That means that if you disable sampling, or use a low value, you will be able to correlate by processing the logs with the IDs, but won't benefit from Kibana UI. Depending on what you are trying to diagnose here, that might be enough as a first step.

okay, now it's clear
we will try adjust our upstream sampling rate app based on this
thank you