Good(?) news:
- Using
elastic.apm.disable_instrumentations=concurrent,executor
partially fixes the issue.
Spans are not deactivated with delay anymore, the warning is gone, but I still have some
_render_ spans not being registered.
Maybe those two are different issues?
- some render spans are not being captured properly
-
concurrent,executor
instrumentations are causing span references to be left around and manipulated on other threads/transactions
Something else I noticed, is that everytime the issue (delayed deactivation) happened, the first part of the internal id (I can see it when I .dump()
the span/transaction object in groovy):
co.elastic.apm.api.SpanImpl@39eec638 span='SimpleGrailsController' 00-6e5df9c349ec25a7cb1f13e3f538def9-ca62deb08bf83a8a-01
which in this example is 6e5df9c349ec25a7cb1f13e3f538def9
(and corresponds the trace.id
on the raw doc on elastic) is all zeroes, like in:
co.elastic.apm.api.SpanImpl@433b89cc span='' 00-00000000000000000000000000000000-f57c7c7a7f92ad90-00
and the span name is always an empty string too.
I can reproduce this specific pattern when I press F5 on the front-end service twice (without waiting for the first one to finish). Whenever I do this, a new issue comes into play: pretty much all future transactions with the same name (or some other similar characteristics) will be stuck on this ghost transaction and won't appear on elastic/kibana at all (they seem to be arriving at the apm-server
though, based on the logs, but it could be just metrics).
/*around here I did a quick double-F5, and the second one got into ghost-mode*/
/*note that the transaction.id leaks from one to the other*/
[2019-03-15T21:29:59.856-0300] currentSpan=co.elastic.apm.api.SpanImpl@6d2c961d::<co.elastic.apm.api.SpanImpl@321e736f span='SimpleGrailsController' 00-8b67581e24992dc27b0e0471a5e5c5ca-2db20472856b83d8-01>
[2019-03-15T21:30:00.399-0300] currentSpan=co.elastic.apm.api.SpanImpl@2947804c::<co.elastic.apm.api.SpanImpl@76e0ee75 span='' 00-00000000000000000000000000000000-2db20472856b83d8-00>
/*each one of these logs is one F5 after it was triggered, 12/15 ghost ones with no trace.id, and a lot of duplicate/leaked transaction.ids */
[2019-03-15T21:30:04.004-0300] currentSpan=co.elastic.apm.api.SpanImpl@3de838ea::<co.elastic.apm.api.SpanImpl@3462bb14 span='SimpleGrailsController' 00-d13d0064f47715d97ab4ce737d71253e-d9c2c8baa610f97a-01>
[2019-03-15T21:30:09.480-0300] currentSpan=co.elastic.apm.api.SpanImpl@40555a5e::<co.elastic.apm.api.SpanImpl@493cf282 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:13.715-0300] currentSpan=co.elastic.apm.api.SpanImpl@61417412::<co.elastic.apm.api.SpanImpl@5216f548 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:16.085-0300] currentSpan=co.elastic.apm.api.SpanImpl@4a706651::<co.elastic.apm.api.SpanImpl@440f546f span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:18.613-0300] currentSpan=co.elastic.apm.api.SpanImpl@2674300::<co.elastic.apm.api.SpanImpl@1f7425f2 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:20.714-0300] currentSpan=co.elastic.apm.api.SpanImpl@3e84a30c::<co.elastic.apm.api.SpanImpl@52b84049 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:22.301-0300] currentSpan=co.elastic.apm.api.SpanImpl@3fc080ca::<co.elastic.apm.api.SpanImpl@58e46a7 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:22.756-0300] currentSpan=co.elastic.apm.api.SpanImpl@36bb420b::<co.elastic.apm.api.SpanImpl@14e0dcba span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:26.279-0300] currentSpan=co.elastic.apm.api.SpanImpl@60f0f959::<co.elastic.apm.api.SpanImpl@2caa4e27 span='' 00-00000000000000000000000000000000-d9c2c8baa610f97a-00>
[2019-03-15T21:30:26.438-0300] currentSpan=co.elastic.apm.api.SpanImpl@5a777ed5::<co.elastic.apm.api.SpanImpl@4f14e445 span='SimpleGrailsController' 00-a1da5e823e3be5b560fbe29d1e5c1a10-e2eccdd0f465fea5-01>
[2019-03-15T21:30:26.641-0300] currentSpan=co.elastic.apm.api.SpanImpl@234c8568::<co.elastic.apm.api.SpanImpl@30e9475e span='' 00-00000000000000000000000000000000-e2eccdd0f465fea5-00>
[2019-03-15T21:30:27.017-0300] currentSpan=co.elastic.apm.api.SpanImpl@126c400a::<co.elastic.apm.api.SpanImpl@760ff824 span='' 00-00000000000000000000000000000000-e2eccdd0f465fea5-00>
[2019-03-15T21:30:27.108-0300] currentSpan=co.elastic.apm.api.SpanImpl@1761cba2::<co.elastic.apm.api.SpanImpl@74b11af8 span='SimpleGrailsController' 00-21f701730133903e69d13316d8aac1e0-9af48d079a2699e1-01>
[2019-03-15T21:32:15.840-0300] currentSpan=co.elastic.apm.api.SpanImpl@28d01e71::<co.elastic.apm.api.SpanImpl@5dcea18c span='' 00-00000000000000000000000000000000-9af48d079a2699e1-00>
[2019-03-15T21:34:24.377-0300] currentSpan=co.elastic.apm.api.SpanImpl@79dffda9::<co.elastic.apm.api.SpanImpl@4573710 span='' 00-00000000000000000000000000000000-9af48d079a2699e1-00>
I have reproduced this behavior (ghost transactions stuck "forever") regardless of my custom Apache HTTP async, custom span creation or render instrumentation. Leaving only public-api and servlet-api (for the simplest shell of a transaction and custom spans) still gave me a messed up scenario from time to time.
I tried setting up log_level=TRACE and was able to see that a correct transaction has logs all over specifying when it starts/ends, but a ghost one doesn't (it only has span starts/ends).
Please let me know if there's anything else I can provide to help understand (and hopefully fix) this.