Distrubuted Trace wrong order in timeline

suikast42 · May 3, 2019, 3:50pm

Hi,

I test a simple use case. A Jms client sends message to a listener and get a result from it.

The testcase and the expected trace is shown below.

( I use the Elk stack 7.0.1 and apm agent 1.6.1 with oracle jdk 8 )

But get the folowing apm tracing from kibana.

And this is the discover page of that trace

Eyal_Koren · May 5, 2019, 9:52am

Hi again !
Can you enclose the code you used for this test app?

sqren · May 5, 2019, 4:23pm

Hi,

This sounds related to another thread posted earlier: "APM transactions displayed in wrong time order".

A fix was applied here: https://github.com/elastic/kibana/issues/24072. As of this time it has not yet been released, but will be released with Kibana 7.2.

To verify that this actually solves your problem, can you please post the XHR response for the "/traces" endpoint. The url will look like:
<host>/apm/traces/88cf30e3cf1b935d976238b0d92ec81a?start=2019-04-24T11%3A01%3A31.927Z&end=2019-04-25T11%3A01%3A31.927Z

suikast42 · May 5, 2019, 6:13pm

So I made some further tests.

I write two jax-rs web serivces with Wildfly 16 and JavaEE8.

SeriveA#method1 -> SeriveB#method1 -> SeriveA#method2

All the calls are aysnc.

SerivceA

@GET
@Path("/method1")
@Produces(MediaType.APPLICATION_JSON)
public Response method1() {
    logger.info("Start method1");
    final Future<Response> responseFuture = client.target("http://localhost:8080/api/serivceB/method1")
            .request(MediaType.APPLICATION_JSON).async().get();
    logger.info("End method1");
    return Response.ok(new MyResponse("It's ok ServiceA method1")).build();
}

@GET
@Path("/method2")
@Produces(MediaType.APPLICATION_JSON)
public Response method2() {
    logger.info("Start method2");
    try {
        Thread.sleep(1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
    logger.info("End method2");
    return Response.ok(new MyResponse("It's ok ServiceA method2")).build();
}

Service B

@GET
@Path("/method1")
@Produces( MediaType.APPLICATION_JSON )
public Response method1() {
    logger.info("Start method1");
     client.target("http://localhost:8080/wms/api/serivceA/method2")
            .request(MediaType.APPLICATION_JSON).async().get();
    logger.info("End method1");
    return  Response.ok(new MyResponse("It's ok ServiceB method1")).build();
}

And the logoutput

20:00:05,057 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) Start method1
20:00:05,058 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) End method1
20:00:05,081 INFO  [pas.core.base.web.rest.boundary.ServiceB] (default task-9) Start method1
20:00:05,082 INFO  [pas.core.base.web.rest.boundary.ServiceB] (default task-9) End method1
20:00:05,093 INFO  [pas.core.base.web.rest.boundary.ServiceA] (default task-9) Start method2

The Trace of that flow seems ok.

I think the problem occurs if I set myself the traceid

suikast42 · May 5, 2019, 6:35pm

So the second test.

A JaxRS service sends a message over jms and the consumer manually set the traceID

JmsSender JaxRS Serivice ( Autoinstrumented by Elastic Ajava Agent )

 @GET
@Path("/send")
@Produces(MediaType.APPLICATION_JSON)
public Response send() {
    logger.info("Start Jms Send");
    try {
        final Map<String,String> headers = new HashMap<>() ;
        final JMSProducer producer = context.createProducer();
        ElasticApm.currentTransaction().injectTraceHeaders(headers::put);
        logger.info("TraceID "+headers.get(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getElasticheader()));
        // The trace ID is traced in jms message header
        producer.setProperty(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getJmsheader(),headers.get(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getElasticheader()));
        producer.setProperty("type", "T2C");
        producer.setProperty("sender", "Dummy");
        producer.send(topic,"Hello");
        logger.info("Send ok");
    } catch (Exception e) {
        return Response.status(Response.Status.INTERNAL_SERVER_ERROR).entity(e).build();
    }  finally {
        logger.info("Finish Jms Send");
    }
    return Response.ok(new ServiceA.MyResponse("It's ok JmsTest send")).build();
}

Jms receiver:

 @Override
public void onMessage(Message message) {
    logger.info("Start onMessage");
    AtomicReference<String> apmTrace = new AtomicReference<>();
    try {
        // The trace ID is traced in jms message header
        apmTrace.set(message.getStringProperty(ElasticApmUtils.Es_TraceHeaders.ElasticApmTraceparent.getJmsheader()));
    } catch (JMSException e) {
        // IGNORE
    }

    logger.info("TraceID " + apmTrace.get());

    // We have only one header
    final Transaction transaction = ElasticApm.startTransactionWithRemoteParent(headerName -> apmTrace.get())
            .setType("external-event")
            .setName("DummyController#onMessage");
    try (Scope scope = transaction.activate()){
       Thread.sleep(100);
    } catch (Exception e) {
         transaction.captureException(e);
    }finally {
        transaction.end();
        logger.info("Finish onMessage");
    }
}

And here comes the strange behaviour.
I called the rest endpoint mor times. Some of the traces illustrated as expected and some not.

Wrong trace:

Right trace:

Both calls from discover page

Eyal_Koren · May 6, 2019, 6:45am

@suikast42 Let's try @sqren's suggestion.
In your browser, go to your APM service page in Kibana and open the developer tools (eg F12 in Chrome). Click on the problematic trace, select the XHR filter and look for a URL with the structure <host>/api/apm/traces/88cf30e3cf1b935d976238b0d92ec81a?start=2019-04-24T11%3A01%3A31.927Z&end=2019-04-25T11%3A01%3A31.927Z. It should look something like this:

Then, click the Response tab and copy and upload the content:

suikast42 · May 6, 2019, 7:11am

{"trace":[{"@timestamp":"2019-05-05T18:56:47.201Z","transaction":{"duration":{"us":9000},"type":"request","result":"HTTP 2xx","sampled":true,"span_count":{"dropped":0,"started":0},"id":"9f617a2a2dab3821","name":"JmsTest#send"},"process":{"pid":4091180,"title":"C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\bin\\java.exe"},"timestamp":{"us":1557082607201003},"observer":{"version_major":7,"id":"f3db7b33-3355-4392-bf02-61baf731ae35","ephemeral_id":"ffa1fad6-137d-4bad-a95b-19a2c6867a2c","type":"apm-server","hostname":"apm-server","version":"7.0.1"},"ecs":{"version":"1.1.0-dev"},"processor":{"name":"transaction","event":"transaction"},"service":{"name":"wms-kemp","language":{"name":"Java","version":"1.8.0_171"},"runtime":{"version":"1.8.0_171","name":"Java"}},"agent":{"name":"java","version":"1.6.1"},"host":{"os":{"platform":"Windows 10"},"ip":"172.17.8.1","hostname":"C0071","architecture":"amd64"},"trace":{"id":"730642fe5eff111260759682833886c8"},"http":{"version":"1.1","request":{"headers":{"Host":["localhost:8080"]},"socket":{"encrypted":false,"remote_address":"0:0:0:0:0:0:0:1"},"method":"get"},"response":{"finished":true,"status_code":200,"headers":{"Content-Type":["application/json"]},"headers_sent":false}},"url":{"domain":"localhost","path":"/wms/api/jmstest/send","port":8080,"scheme":"http","full":"http://localhost:8080/wms/api/jmstest/send"}},{"@timestamp":"2019-05-05T18:56:47.216Z","process":{"title":"C:\\Program Files\\Java\\jdk1.8.0_171\\jre\\bin\\java.exe","pid":4091180},"parent":{"id":"9f617a2a2dab3821"},"ecs":{"version":"1.1.0-dev"},"processor":{"name":"transaction","event":"transaction"},"transaction":{"span_count":{"started":0,"dropped":0},"id":"c7af336fa33db2e1","name":"DummyController#onMessage","duration":{"us":99931},"type":"external-event","sampled":true},"agent":{"version":"1.6.1","name":"java"},"host":{"hostname":"C0071","architecture":"amd64","os":{"platform":"Windows 10"},"ip":"172.17.8.1"},"service":{"name":"wms-kemp","language":{"name":"Java","version":"1.8.0_171"},"runtime":{"name":"Java","version":"1.8.0_171"}},"trace":{"id":"730642fe5eff111260759682833886c8"},"timestamp":{"us":1557082607216001},"observer":{"ephemeral_id":"ffa1fad6-137d-4bad-a95b-19a2c6867a2c","type":"apm-server","hostname":"apm-server","version":"7.0.1","version_major":7,"id":"f3db7b33-3355-4392-bf02-61baf731ae35"}}],"errorsPerTransaction":{}}

And that is the discovery result.

sqren · May 6, 2019, 1:23pm

That was very helpful, thanks.

Your trace looked like this before the fix was applied:

And with the fix it looks like this:

As mentioned, the fix will be released in Kibana 7.2.

suikast42 · May 7, 2019, 6:26am

That looking ver good.
But I see still a little bug. The duration of the Transaction sample is 9,000 us shouldn't it be over 100ms ?

Eyal_Koren · May 7, 2019, 6:29am

I think this shows the correct timings.
According to the code you enclosed, the sending is non-blocking, meaning - it posts a message to the queue and immediately returns a response.
The 100 ms only happens on the receiver (onMessage) side.

suikast42 · May 7, 2019, 6:55am

Yes you are absoulte right. But shouldn't the duration cover the hole span ( traceid ) ?

Eyal_Koren · May 7, 2019, 7:06am

Well, it depends....
If you have a service that does lots of stuff, one of which is posting a message to a queue for later asynchronous processing, like - gathering statistics, and you set an alert on your service to make sure it performs within a SLA, would you like to get alerts for delayed processing, even though SLAs are maintained?
Often times, messaging can serve these kind of options where you don't care about delays, so this is tricky.

Can you elaborate about how you use messaging in your real apps and what you would expect from it? For example, are you using it actually in a synchronous way, or like the code sample you posted? And if you use it this way (asynchronously) - would you like your real app to consider the receiver processing as part of the transaction duration? In all cases or only some?

We are starting to look into how we deal with such asynchronous messaging in general in APM, so this can be a valuable input for us.

suikast42 · May 7, 2019, 7:18am

We use messaging for decouple our IOT processing ( Steram processing ,OPC UA .. ) from our business logic. In that async way as I show.

From my point of view it's important to see the whole span
Iot processing -> Business logic -> Iot response

It's wrong to decide the duration of the whole traceId ?

Eyal_Koren · May 7, 2019, 7:23am

It's wrong to decide the duration of the whole traceId ?

I don't know, I guess sometimes it is what you want and sometimes not

I ask you- let's say your IoT processing takes a bit to execute. Does this means there is a latency problem with the service that is using it? If you monitor your services and the processing gets delayed, does this something you would like your services to count in their duration?

suikast42 · May 7, 2019, 7:33am

I am agree with that

If I understand the question right then no . The IOT serivice and the bunisess service doesn't know each other.

The latency can occur during a OPC/UA call in IOT or database call in business processing.

Eyal_Koren · May 7, 2019, 7:38am

If I understand YOU properly , it seems like you prefer the time measurement will remain as it is now, right?

suikast42 · May 7, 2019, 8:00am

I think so.

In my case it's maybe better to do custom analyis on the traces for detect messaging procesing, business logic processing etc. Beacause it's realy depends on

Eyal_Koren · May 7, 2019, 9:08am

OK. Thanks a lot for you feedback!

suikast42 · May 7, 2019, 9:38am

Thanks too.

I will wait for the v 7.2 and report feddbacks again ;-).

Is there a schedule date for 7.2 ?

Eyal_Koren · May 7, 2019, 10:17am

There is no specific scheduled date, please watch for release announcements.
We will be happy to get further feedback.
Thanks!

Topic		Replies	Views
APM - Trace samples ordering APM ui	4	450	October 25, 2022
APM transaction didn't consider the delay in jms APM java	5	416	October 16, 2019
Java Agent : Traces not getting stored in Elasticsearch APM java , server , ui	5	285	May 1, 2024
APM transactions displayed in wrong time order APM	7	904	May 21, 2019
APM tracing is not working for distributed application APM java	3	1313	July 11, 2019

Distrubuted Trace wrong order in timeline

Related topics