Elastic APM dropping spans from large distributed transactions

We have multiple messaging based micro-services, where a transaction is started from first incoming message, then forked into thousands of spans (messages to different micro-services).
We started seeing that spans started to dropped.

I understand there is a limit on number of spans can be captured in a transaction, Is there a specific reason? ideally we want to capture all spans and possibly view on Kibana UI

Also we noticed Kibana shows spans for a short point of time then drops all the spans from the timeline, alerting as number of spans have exceed the limit.

We want to understand why APM agent/server has this span limits and how to overcome that limit.

versions
elastic-apm-agent 1.11.0
Kibana 7.4.0

Hi and thanks for your question!

You can increase the limit of the Java agent with this config option: https://www.elastic.co/guide/en/apm/agent/java/current/config-core.html#config-transaction-max-spans

Note that there's also a limit in Kibana: xpack.apm.ui.maxTraceItems which defaults to 1000.

Hope this helps!

Thanks for the reply.

We did increase the limit on java agent, but it seems it again limits to 36,000 spans (if I am not mistaken)

The Kibana UI is not even showing 1000 traces, it just shows 2 or 3 line items, but we are going to try to increase the limit. Is there is any hard limit to xpack.apm.ui.maxTraceItems?

There's no hard limit in the agent and I don't think there's one in the UI. However, if the APM Server can't keep up with all the spans, some may be dropped in the agent to limit memory usage. You can increase the queue size via https://www.elastic.co/guide/en/apm/agent/java/current/config-reporter.html#config-max-queue-size.

May I ask what your use case is and why you want to visualize 36k spans for a single transaction?

We have a distributed files processing use case, let me describe the steps.

  1. One of the service receives a notification to start the file processing. This service opens the APM transaction and start reading the file. This service then forks the file into various traces/spans by sending multiple notifications to other micro-service.
  2. The other micro-services consumes this message and further forks the transaction into multiple traces/spans by either doing http calls, DB calls or sending notifications to other services.

We anticipate our average file will generate close to 30K spans and few file will be above 50k

Hope this helps to understand the use case