After going live for several weeks with Elastic APM and getting lots of good insights, one of our recurring issues is is having a better visibility of the gaps in between spans, like in the graph below:
The gap between the two JDBC calls cannot be determined from the Kibana tools, and there is not enough info to say what caused that gap. Our investigation on other tools showed it had to do with running out of connection pools and the request needed to wait until one was available before continuing. The application had two consecutive JDBC calls inside a method.
Is there something we could have done to improve visibility of that gap? Or do we need to explicitly code for similar scenarios (not clear for us how to embed a span manually in the middle of an auto-instrumented class though)?
For this specific issue, it probably makes sense that we instrument javax.sql.DataSource#getConnection() (I assume this is the method which took long in your case).
Which tools did you use and how did you find out about that.
In case you got that from JMX metrics, you can now include those via the capture_jmx_metrics config option.
We currently don't offer that but we could overlay metrics in the transactions view to make it easier to correlate the captured metrics with traces.
You basically have three options here:
Programmatically: Get the current span (possibly created by auto instrumentation) and create a child.
Advantage: most flexible way, you can add custom labels to the span
Declaratively: Annotate an arbitrary method with @CaptureSpan.
Advantage: Easier, more robust (there's nothing you can do wrong like forgetting to end a span or close a scope) and more performant than the programmatic way
Via configuration: Use trace_methods to specify additional methods to instrument.
Advantage: you don't need to modify the source code
Thanks for these insightful answers. These have set us on a new direction to start trying the approaches you've outlined. That list actually gave me an idea to try on a different -- although similar -- case involving some OS Netflix components we used (Feign/Ribbon).
I have to get back to get back to you on how the team found out about the connection pool issue causing that wait time between JDBC calls. I do know we have a Prometheus/Grafana stack for metrics (been there for years) so its possible that there was a JMX exporter involved. I'll have to confirm with them though.
The scenario was that we suddenly had a spike of around 55K tpm for a bit of time so I'm not surprised that overwhelmed our available connections.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.