We are instrumenting our Spring Boot services using the latest Elastic APM Agent and in Kibana the traces are all grouped by their parent spans. Unfortunately this makes almost all spans grouped under "ServletWrappingController" which is not very helpful. Is there a way to rename the parent span so this is more meaningful?
Some of our service are being instrumented using the OpenTelemetry agent and it allows the parent span to be renamed. This helps us group traces more logically based on the api and method being called.
The OpenTelemetry docs acknowledge and address this:
The state described above has one significant problem. Observability backends usually aggregate traces based on their root spans. This means that ALL traces from any application deployed to Servlet container will be grouped together. Because their root spans will all have the same named based on common entry point. In order to alleviate this problem, instrumentations for specific frameworks, such as Spring MVC here, update name of the span corresponding to the entry point. Each framework instrumentation can decide what is the best span name based on framework implementation details. Of course, still adhering to OpenTelemetry semantic conventions.
We pretty much do the same. For Spring MVC we also set the transaction name based on the MVC controller that handles the request. If it's an custom or unsupported framework, you can use the API to set the name of the transaction.
Which framework are you using? Possibly it's not much effort to add auto-instrumentation for it.
I think the issue is that we give precedence to Spring controllers/HandlerMethods as they are usually more descriptive than the DispatcherServlet that invokes them. But in this case, ServletWrappingController is invoking another servlet whose name is even more appropriate.
We could either have a special case for ServletWrappingController or, if you don't want any transactions named after Spring MVC controllers, you can also disable the spring-mvc instrumentation.
Thanks for the quick reply and PR @felixbarny. These are jsonRPC calls being handled by our custom library so the meaningful values will be in the request body. However I think your change is still useful for other services.
We are actually using the opentracing-api instead of apm-agent-api directly since there are shims available from both Elastic APM and OpenTelemetry for the OpenTracing Tracer.
We are forced to use a mixed setup because we had scaling issues using the Elastic APM Java Agent on our very high rps services (40,000+ rps). Even with a low sample rate the high level transactions are still reported to the APM server and that slowed things down significantly (see https://github.com/elastic/apm/issues/104 and https://github.com/elastic/apm/issues/151).
Switching to using the OTEL or Jaeger agent and exporting to APM server's Jaeger endpoint works, but it is a hacky solution. I'm not sure if things have changed recently, but if it would be possible for the Java agent to only send sampled transaction information instead of all transactions that would make thing scale much easier.
We'll add some experimental options to calculate metrics based on transactions in the upcoming 7.11 release. Be sure to try that out and give us feedback.
Can you elaborate on that a bit? What has slowed down? Did you experience higher latencies in your application endpoints, or did you observe the effect only on the ingestion pipeline (agent -> APM Server - ES)? Did you try to see what happens with VERY slow sample rate (e.g. 0.0 - 0.001) to validate that the overhead is indeed related to ingestion and not related to the instrumentation/tracing overhead?
We had the service's sample rate set at 0.01% and there were no issues there. The slowdown was in the ingestion pipeline because billions of events were being created every day for the top level transaction information. Issue 151 and the subsequent discussions provide a lot of details https://github.com/elastic/apm/issues/151. It deals with the node agent but we saw the same behavior with Java.
The APM Server's UI shows the number of transactions in each latency bucket including ones which weren't sampled and it also gives overall latency numbers. We don't really need this information since we have other tools like Prometheus to capture histogram buckets of request latencies. A random sample will also approximate the correct distribution in APM without needing to record data from all transactions.
Thanks for the details. It validates our efforts towards not sending unsampled transactions (relying on Elasticsearch's new histogram data type instead) and smarter, tail-based, sampling.
One thing I am still missing is whether or not you observed overhead in your application's endpoints latencies with the higher sampling rate, or any other noticeable overhead on CPU or memory (in the agent side).
I don't think there was any noticeable overhead on the application with sampling at under 1%. We did not attempt sampling at a higher rate because it would cause issues with ingestion.
Is this something that can be enabled on the agent right now?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.