Java GRPC Client not ending spans and not reporting to APM

Kibana version: 7.13.4

Elasticsearch version: 7.13.4

APM Server version: 7.13.4

APM Agent language and version: Java 8, apm-agent-attach, apm-agent-api, elastic-apm-agent 1.25.0

Browser version: Chrome

Original install method (e.g. download page, yum, deb, from source, etc.) and version: Maven

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):

Steps to reproduce:

  1. When running in a custom Transaction, for every grpc client request a span is created, but never ended -> it will not get logged to the apm server.
  2. @Traced work as expected

In the following code, myTracedMethod and myOtherTracedMethod will get reported, the grpc request description will not.

final TransactionServiceGrpc.TransactionServiceBlockingStub blockingStubServer = TransactionServiceGrpc.newBlockingStub(channelServer);

        final Transaction transaction = ElasticApm.startTransaction();
        try(final Scope scope = transaction.activate()){
            transaction.setName("TestApmGrpc");
            myTracedMethod();
            blockingStubServer.description(GrpcService.DescriptionRequest.newBuilder().build());
            myOtherTracedMethod();
        }
        catch (Throwable t){
            transaction.captureException(t);
            throw t;
        }
        finally { transaction.end(); }

The complete DEBUG log is attached.

Provide logs and/or server output (if relevant):

Hi and welcome to our forum :wave:

Adding @Traced on what solved the issue?

Hi, adding @Traced is not fixing it, but normal methods annotaed with it are reported as expected. I added them for the log, so that one can see that the methods are finished and other spans are working correctly.

Ahh, I see. I assume we are talking about the transaction.TransactionService/Description span.

In the debug log I can see that the io.grpc.ClientCall.Listener#onClose instrumentation is applied to two listener types. This is the instrumentation that takes care of ending the client spans. Are you using a custom listener as well? Can you verify that this listener API is actually being called for the traced client calls? If so, can you confirm that onClose() is being invoked with a non-null io.grpc.Status?

I don't use any custom listeners (in this test), but grpc is running io.grpc.internal.DelayedClientCall$DelayedListener::onClose which calls io.grpc.stub.ClientCalls$UnaryStreamToFuture::onClose, both with Status.OK. Everything running in a io.grpc.stub.ClientCalls::blockingUnaryCall.

I generated a stacktrace:

java.lang.Exception
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:524)
	at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:463)
	at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:427)
	at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:460)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:553)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:68)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:739)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:718)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
	at io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:740)
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:149)
	at transaction.TransactionServiceGrpc$TransactionServiceBlockingStub.description(TransactionServiceGrpc.java:156)
	at apmTests.Main.main(Unknown Source)

This is helpful.

Your stack trace includes:

io.grpc.internal.DelayedClientCall$DelayedListener.onClose

and your debug log contains:

... Method match for instrumentation ListenerClose: ... matches public void io.grpc.internal.DelayedClientCall$DelayedListener.onClose(io.grpc.Status,io.grpc.Metadata)

Which means that this listener is instrumented as expected by ListenerClose, so when the onClose() method exits after being called with a non-null Status, it should end the client span.
If you can setup a debug environment with our agent code and debug co.elastic.apm.agent.grpc.GrpcHelper#exitClientListenerMethod(), you may be able to find out what's not working as expected.

  1. In GrpcHelper clientCallStartEnter is called with what seems like correct parameters, the remove call inside returns null though. clientCallSpans contains a single item, the pending call.
  2. enterClientListenerMethod is called three times afterwards, with the DelayedListener as a parameter. clientCallListenerSpans is empty, the get returns null. The last time it is called is during the onClose of the DelayedListener (onEnter instrumentation).
  3. onClose of the DelayedListener runs
  4. Program ends

exitClientListenerMethod doesn't seem to run at any time.

ClientCallImplInstrumentation$ListenerClose.onExit is called, but the span passed is null, so the call to exitClientListenerMethod gets skipped. listener is the DelayedListener, status is OK and thrown is null.

Thank you for your help!

It looks like the key in clientCallStartEnter that is searched for is:

ClientCallImpl{method=MethodDescriptor{fullMethodName=transaction.TransactionService/Description, type=UNARY, idempotent=false, safe=false, sampledToLocalTracing=true, requestMarshaller=io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@5d332969, responseMarshaller=io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@7cc9ce8, schemaDescriptor=transaction.TransactionServiceGrpc$TransactionServiceMethodDescriptorSupplier@2e27d72f}}

while the key of the item in clientCallSpans is

PendingCall{realCall=ClientCallImpl{method=MethodDescriptor{fullMethodName=transaction.TransactionService/Description, type=UNARY, idempotent=false, safe=false, sampledToLocalTracing=true, requestMarshaller=io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@5d332969, responseMarshaller=io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller@7cc9ce8, schemaDescriptor=transaction.TransactionServiceGrpc$TransactionServiceMethodDescriptorSupplier@2e27d72f}}}

OK, so the issue is that the span is not registered properly in the first place.
What's responsible for the span creation and registration is ChannelInstrumentation, for which I do see matches in the log.
Try to debug this instrumentation advice methods (onEnter and onExit) and see if you get anything. The cause may be that we are not instrumenting the relevant io.grpc.Channel instrumentation.
What you can also do is debug the io.grpc.Channel#newCall interface method and see if there are invocations we miss entirely.
We exclude io.grpc.internal.ManagedChannelImpl$RealChannel from instrumentation, maybe it is required in your case, but look if there is something else.
Also, which gRPC version are you using?

Oh, I completely forgot to add the gRPC version. I'm using 1.39.0. I just now tried it with much older versions, it looks like it broke after gRPC 1.32.2. With 1.32.2 it works as expected, after that it does nothing. In 1.33.0 it looks like they changed some channel subclassing, but that was reverted in 1.33.1. Neither of those versions (or any subsequent) work anymore with APM.

In 1.39.0 ChannelInstrumentation.onExit gets passed a wrapper around the call, a PendingCall. In 1.32.2 it gets passed the "real" ClientCallImpl object. clientCallStartEnter always receives the ClientCallImpl object. I guess that change was introduced in core: use DelayedClientCall for awaiting configSelector in ManangedChannelImpl by dapengzhang0 · Pull Request #7259 · grpc/grpc-java · GitHub

OK. I tried something real quick to check a direction.

Please try out this snapshot (ignore the color, the gRPC client test is expected to fail due to an extra span) and see what you get.

I just restored instrumentation for io.grpc.internal.ManagedChannelImpl$RealChannel, which we excluded before. It may cause extra spans, but hopefully we can pinpoint the exclusion of this Channel implementation as the culprit.

Looking forward to get you feedback.
Thanks.

Doesn't seem to have fixed the problem, the RealChannel is instrumented though. I have attached the log.
I have also uploaded my code example.

The attached project doesn't seem to be complete.

In any case, why do you have this dependency:

        <dependency>
            <groupId>co.elastic.apm</groupId>
            <artifactId>elastic-apm-agent</artifactId>
            <version>${elastic-apm.version}</version>
            <scope>provided</scope>
        </dependency>

?

I am not sure how your actual project is setup, but provided dependencies are added to the test classpath, so this may affect your results. The agent should never be added as a dependency, only attached using one of the documented methods.

Please see if this is related or what is missing from the referenced project.

I'll have a look if something is missing in the project. I added the dependency according to the trouble-shooting guide to enable debugging.

OK, but if it's on the classpath, it may be problematic.
If you did not already, please try to run the test with the snapshot I provided without this agent dependency.

I tried running it again without in in the classpath and with your Snapshot. I got (basically) the same log as above.
The project should work, you have to compile it first so that the grpc code gets generated and the IDE recognices it. Then MainServer needs to be startet (it's own process so that it doesn't get any apm instrumentalization), and MainClient connects to the server when started. Nothing in a test framework, to not add additional error sources.

I added my elasticapm.properties in the project.

OK, that's weird.
Using your project, I could reproduce what you report:

However, keeping everything the same and only restoring instrumentation for io.grpc.internal.ManagedChannelImpl$RealChannel, the span is created as expected:

So I do believe this channel's instrumentation is related, but we need to figure out what's working differently on your setup. Please try to use the provided snapshot and debug GrpcHelper#clientCallStartEnter and GrpcHelper#registerSpan again, see what happens. The registerSpan should be called twice per client call and store two keys in the clientCallSpans map, then clientCallStartEnter may be called only once, but it should find the key. Please let me know if you realize why it doesn't.

@YvesAndre I started a big change in our gRPC instrumentation, part of which should fix the issue you reported. Can you please test the branch's snapshot and see how it works for you?

@Eyal_Koren Looks great! Seems to work without problems. Do you want / need a new apm.log of it working?

No, your confirmation is perfect!
Thanks a lot for you invaluable feedback!