Distrubuted Trace wrong order in timeline

Hi,

I test a simple use case. A Jms client sends message to a listener and get a result from it.

The testcase and the expected trace is shown below.

( I use the Elk stack 7.0.1 and apm agent 1.6.1 with oracle jdk 8 )

But get the folowing apm tracing from kibana.

And this is the discover page of that trace

Hi again :slight_smile: !
Can you enclose the code you used for this test app?

Hi,

This sounds related to another thread posted earlier: "APM transactions displayed in wrong time order".

A fix was applied here: https://github.com/elastic/kibana/issues/24072. As of this time it has not yet been released, but will be released with Kibana 7.2.

To verify that this actually solves your problem, can you please post the XHR response for the "/traces" endpoint. The url will look like:
<host>/apm/traces/88cf30e3cf1b935d976238b0d92ec81a?start=2019-04-24T11%3A01%3A31.927Z&end=2019-04-25T11%3A01%3A31.927Z

So I made some further tests.

I write two jax-rs web serivces with Wildfly 16 and JavaEE8.

SeriveA#method1 -> SeriveB#method1 -> SeriveA#method2

All the calls are aysnc.

SerivceA

@GET
@Path("/method1")
@Produces(MediaType.APPLICATION_JSON)
public Response method1() {
    logger.info("Start method1");
    final Future<Response> responseFuture = client.target("http://localhost:8080/api/serivceB/method1")
            .request(MediaType.APPLICATION_JSON).async().get();
    logger.info("End method1");
    return Response.ok(new MyResponse("It's ok ServiceA method1")).build();
}

@GET
@Path("/method2")
@Produces(MediaType.APPLICATION_JSON)
public Response method2() {
    logger.info("Start method2");
    try {
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    logger.info("End method2");
    return Response.ok(new MyResponse("It's ok ServiceA method2")).build();
}

Service B

@GET
@Path("/method1")
@Produces( MediaType.APPLICATION_JSON )
public Response method1() {
    logger.info("Start method1");
     client.target("http://localhost:8080/wms/api/serivceA/method2")
            .request(MediaType.APPLICATION_JSON).async().get();
    logger.info("End method1");
    return  Response.ok(new MyResponse("It's ok ServiceB method1")).build();
}

And the logoutput

20:00:05,057 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) Start method1
20:00:05,058 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) End method1
20:00:05,081 INFO  [pas.core.base.web.rest.boundary.ServiceB] (default task-9) Start method1
20:00:05,082 INFO  [pas.core.base.web.rest.boundary.ServiceB] (default task-9) End method1
20:00:05,093 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) Start method2

The Trace of that flow seems ok.

I think the problem occurs if I set myself the traceid

So the second test.

A JaxRS service sends a message over jms and the consumer manually set the traceID

JmsSender JaxRS Serivice ( Autoinstrumented by Elastic Ajava Agent )

 @GET
@Path("/send")
@Produces(MediaType.APPLICATION_JSON)
public Response send() {
    logger.info("Start Jms Send");
    try {
        final Map<String,String> headers = new HashMap<>() ;
        final JMSProducer producer = context.createProducer();
        ElasticApm.currentTransaction().injectTraceHeaders(headers::put);
        logger.info("TraceID "+headers.get(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getElasticheader()));
        // The trace ID is traced in jms message header
        producer.setProperty(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getJmsheader(),headers.get(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getElasticheader()));
        producer.setProperty("type", "T2C");
        producer.setProperty("sender", "Dummy");
        producer.send(topic,"Hello");
        logger.info("Send ok");
    } catch (Exception e) {
        return Response.status(Response.Status.INTERNAL_SERVER_ERROR).entity(e).build();
    }  finally {
        logger.info("Finish Jms Send");
    }
    return Response.ok(new ServiceA.MyResponse("It's ok JmsTest send")).build();
}

Jms receiver:

 @Override
public void onMessage(Message message) {
    logger.info("Start onMessage");
    AtomicReference<String> apmTrace = new AtomicReference<>();
    try {
        // The trace ID is traced in jms message header
        apmTrace.set(message.getStringProperty(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getJmsheader()));
    } catch (JMSException e) {
        // IGNORE
    }

    logger.info("TraceID " + apmTrace.get());

    // We have only one header
    final Transaction transaction = ElasticApm.startTransactionWithRemoteParent(headerName -> apmTrace.get())
            .setType("external-event")
            .setName("DummyController#onMessage");
    try (Scope scope = transaction.activate()){
       Thread.sleep(100);
    } catch (Exception e) {
         transaction.captureException(e);
    }finally {
        transaction.end();
        logger.info("Finish onMessage");
    }
}

And here comes the strange behaviour.
I called the rest endpoint mor times. Some of the traces illustrated as expected and some not.

Wrong trace:

Right trace:

Both calls from discover page

@suikast42 Let's try @sqren's suggestion.
In your browser, go to your APM service page in Kibana and open the developer tools (eg F12 in Chrome). Click on the problematic trace, select the XHR filter and look for a URL with the structure <host>/api/apm/traces/88cf30e3cf1b935d976238b0d92ec81a?start=2019-04-24T11%3A01%3A31.927Z&end=2019-04-25T11%3A01%3A31.927Z. It should look something like this:
02
Then, click the Response tab and copy and upload the content:

{"trace":[{"@timestamp":"2019-05-05T18:56:47.201Z","transaction":{"duration":{"us":9000},"type":"request","result":"HTTP 2xx","sampled":true,"span_count":{"dropped":0,"started":0},"id":"9f617a2a2dab3821","name":"JmsTest#send"},"process":{"pid":4091180,"title":"C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\bin\\java.exe"},"timestamp":{"us":1557082607201003},"observer":{"version_major":7,"id":"f3db7b33-3355-4392-bf02-61baf731ae35","ephemeral_id":"ffa1fad6-137d-4bad-a95b-19a2c6867a2c","type":"apm-server","hostname":"apm-server","version":"7.0.1"},"ecs":{"version":"1.1.0-dev"},"processor":{"name":"transaction","event":"transaction"},"service":{"name":"wms-kemp","language":{"name":"Java","version":"1.8.0_171"},"runtime":{"version":"1.8.0_171","name":"Java"}},"agent":{"name":"java","version":"1.6.1"},"host":{"os":{"platform":"Windows 10"},"ip":"172.17.8.1","hostname":"C0071","architecture":"amd64"},"trace":{"id":"730642fe5eff111260759682833886c8"},"http":{"version":"1.1","request":{"headers":{"Host":["localhost:8080"]},"socket":{"encrypted":false,"remote_address":"0:0:0:0:0:0:0:1"},"method":"get"},"response":{"finished":true,"status_code":200,"headers":{"Content-Type":["application/json"]},"headers_sent":false}},"url":{"domain":"localhost","path":"/wms/api/jmstest/send","port":8080,"scheme":"http","full":"http://localhost:8080/wms/api/jmstest/send"}},{"@timestamp":"2019-05-05T18:56:47.216Z","process":{"title":"C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\bin\\java.exe","pid":4091180},"parent":{"id":"9f617a2a2dab3821"},"ecs":{"version":"1.1.0-dev"},"processor":{"name":"transaction","event":"transaction"},"transaction":{"span_count":{"started":0,"dropped":0},"id":"c7af336fa33db2e1","name":"DummyController#onMessage","duration":{"us":99931},"type":"external-event","sampled":true},"agent":{"version":"1.6.1","name":"java"},"host":{"hostname":"C0071","architecture":"amd64","os":{"platform":"Windows 10"},"ip":"172.17.8.1"},"service":{"name":"wms-kemp","language":{"name":"Java","version":"1.8.0_171"},"runtime":{"name":"Java","version":"1.8.0_171"}},"trace":{"id":"730642fe5eff111260759682833886c8"},"timestamp":{"us":1557082607216001},"observer":{"ephemeral_id":"ffa1fad6-137d-4bad-a95b-19a2c6867a2c","type":"apm-server","hostname":"apm-server","version":"7.0.1","version_major":7,"id":"f3db7b33-3355-4392-bf02-61baf731ae35"}}],"errorsPerTransaction":{}}

And that is the discovery result.

That was very helpful, thanks.

Your trace looked like this before the fix was applied:

And with the fix it looks like this:

As mentioned, the fix will be released in Kibana 7.2.

1 Like

That looking ver good.
But I see still a little bug. The duration of the Transaction sample is 9,000 us shouldn't it be over 100ms ?

I think this shows the correct timings.
According to the code you enclosed, the sending is non-blocking, meaning - it posts a message to the queue and immediately returns a response.
The 100 ms only happens on the receiver (onMessage) side.

Yes you are absoulte right. But shouldn't the duration cover the hole span ( traceid ) ?

image

Well, it depends....
If you have a service that does lots of stuff, one of which is posting a message to a queue for later asynchronous processing, like - gathering statistics, and you set an alert on your service to make sure it performs within a SLA, would you like to get alerts for delayed processing, even though SLAs are maintained?
Often times, messaging can serve these kind of options where you don't care about delays, so this is tricky.

Can you elaborate about how you use messaging in your real apps and what you would expect from it? For example, are you using it actually in a synchronous way, or like the code sample you posted? And if you use it this way (asynchronously) - would you like your real app to consider the receiver processing as part of the transaction duration? In all cases or only some?

We are starting to look into how we deal with such asynchronous messaging in general in APM, so this can be a valuable input for us.

1 Like

We use messaging for decouple our IOT processing ( Steram processing ,OPC UA .. ) from our business logic. In that async way as I show.

From my point of view it's important to see the whole span
Iot processing -> Business logic -> Iot response

It's wrong to decide the duration of the whole traceId ?

It's wrong to decide the duration of the whole traceId ?

I don't know, I guess sometimes it is what you want and sometimes not :slight_smile:

I ask you- let's say your IoT processing takes a bit to execute. Does this means there is a latency problem with the service that is using it? If you monitor your services and the processing gets delayed, does this something you would like your services to count in their duration?

I am agree with that :smile:

If I understand the question right then no :blush:. The IOT serivice and the bunisess service doesn't know each other.

The latency can occur during a OPC/UA call in IOT or database call in business processing.

If I understand YOU properly :slight_smile: , it seems like you prefer the time measurement will remain as it is now, right?

I think so.

In my case it's maybe better to do custom analyis on the traces for detect messaging procesing, business logic processing etc. Beacause it's realy depends on :smile:

OK. Thanks a lot for you feedback!

1 Like

Thanks too.

I will wait for the v 7.2 and report feddbacks again ;-).

Is there a schedule date for 7.2 ?

There is no specific scheduled date, please watch for release announcements.
We will be happy to get further feedback.
Thanks!

1 Like