Missing data in APM Dashboards in Kibana

Elastic Search Version: 7.4.0
APM Server Version: 7.4.0
APM Client Version: APM Java Agent: 1.14 (current)
APM Agent language and version: Java 1.8

@felixbarny

I often to notice missing data in APM Dashboards such as "Time spent by span type", "Transaction duration request per minute" and other metrics dashboards.

I do not see any issue in the logs of APM agent or APM server or in Logstash. Stack monitoring also looks good.

Is there any possibility of data(span, transaction, metrics) lose in java APM agent? I do not see any debug message for ring buffer full and looks like no issue with ring buffer size. I am using default reporting interval(metrics_interval and api_request_time). Do i need to adjust the reporting interval to report more frequently than the default configuration?

Or is it something to be done with Kibana dashboarding logic(bucket interval?)


Update: I changed the reporting interval configuration to report the span, transaction, errors and metrics more often, now the data in Kibana APM dashboards looks fine without gaps(missing values).

@felixbarny Can you please help me to understand the logic for this behavior?

Hey :wave:
We introduced a circuit breaker in 1.14.0, sounds like it may be related. Did you enable it?
Can you see Tracer switched to PAUSED/RUNNING state messages in your logs?

@Eyal_Koren

I am aware of Circuit breaker. We did not enable it. The same behavior is observed even in 1.13 versions.

Thanks,
Suren.

Could you check if the metric documents are completely missing or if their timestamp is misaligned so that the rendering gets confused?

That may happen if metrics have to wait in the queue for some time until they get reported to APM Server. In that case no events would be missing but there might be time frames where no metrics get reported, followed by metrics being reported more frequently than the configured interval.

To check that, go to the discover tab in Kibana, select the apm-* index and search for processor.event: "metric".

@felixbarny Could that be because https://www.elastic.co/guide/en/apm/agent/java/current/config-reporter.html#config-api-request-size is too less? And APM agent has to wait if too much of data to transfer to APM server?

Not quite sure what you mean but it don't think it has something to do with api-request-size.

To add to what I previously described: The metrics get serialized only when they're currently being reported to the APM Server. That's also the time when the timestamp is taken. So even though metrics get put into the queue in a regular interval, the interval at which they are actually serialized varies. If the metrics sit in the queue for some time, it may skew time they get reported (and their @timestamp) in a way that some data_histogram buckets are empty which results in gaps in the graphs.

The fix would be to serialize the metrics before putting them into the queue. The downside of that is a higher memory footprint but that might be worth it. To mitigate some of that footprint, we could reuse the buffers for that.

But first we have to find out if that's actually the problem in your case. So I'd really appreciate if you could check the metric index as described in my previous answer :slightly_smiling_face:

Thanks,
Felix

@felixbarny

I see the data missing is in Kibana Discover view as well. As you suspected, i believe it is something to do with @timestamp value. Metrics reporting should have been delayed somehow and hence the wrong @timestamp value.

Regarding my previous comment, what if the actual request body size is beyond default api_request_size 768kb? Will the agent drop some data or process it in multiple batches?

If api_request_size is exceeded the agent starts a new request.

Within this timeframe (9:25-9:30), are there processor.events of type span or transaction?

Are you seeing any of these logs?
Error sending data to APM server: {}, response code is {}
Backing off for {} seconds (+/-10%)

It seems that the service you have selected runs with multiple instances, right? It seems unlikely that all instances see the gap at the same time due to the potential issue I've described. Maybe it's a transient network hiccup?

Could you select a single instance (a single observer.ephemeral_id) and send me another screenshot. I'm trying to find out if the agents send the metric events later or if they get lost.

@felixbarny

  1. I do not see any issue related to ring buffer, error while sending to APM server, or exponential back off at Agent. Even we verified APM server level and also in logstash level. And also verify stack monitoring but no errors.
  2. For testing purpose, we only have one server for the service name "bizx-cf-new". So what you see in the previous screenshot is only metrics from one server. When we include multiple servers, we never notice gap in graph as any one of the server metrics can fill in this gaps. It can be easily simulated by running a single instance of a simple spring boot test app in personal laptop with Elasticstack running as docker-compose.


The fact that there's also a gap in the transaction data suggests that there's a different issue. If it was simply related to queuing, the transaction data would be back-filled. But there's a spike at around 9:45. To me, it looks like requests have been queuing for 10 seconds because there was some blocking in the application. After the application was unblocked, a bunch of requests have been served at the same time. This would also explain the gaps in "Time spent by span type".

The fact that the gaps are also visible in the metrics seem to suggest that the JVM was in a stop-the-world pause. That would explain both the queued requests and the missing metrics. The usual suspect for stop-the-world pauses would be a major GC but the heap memory chart does not seem like there was a major GC. Which GC algorithm are you using? Could you share a screenshot of the GC activity (under the Heap Memory chart).

I see you have set the Heap size to 35.4 GB. Did you make sure that the JVM is still using compressed OOPS? Usually that limit is at 32GiB.

@felixbarny But when i changed the reporting interval configuration to report the span, transaction, errors and metrics more often(5s), the data in Kibana APM dashboards looks fine without gaps(missing values). Do you see any correlation? Just to give different aspect for this problem.

I can't properly explain that tbh. Maybe you were lucky in the sense that there were no stop-the-world pauses in that run?

The metrics interval does not influence how often transaction events are reported. The Transaction duration graph is based on transaction events (not metrics) and there are also gaps in your screenshot which correlate with the gaps in the metric-based graphs. So gaps in Transaction duration can't be affected by metrics_interval.

Hi @felixbarny

We changed both metrics_interval and api_request_time to 5s.
metrics_interval=5s
api_request_time=5s

After this change we do not see any gap in the APM related dashboards even if we test with only one server.

Does this frequent interval has any known side effects?

Following is JVM related configurations
-Xms36220m -Xmx36220m -XX:NewSize=7244m -XX:MaxNewSize=7244m -XX:MetaspaceSize=1024m -XX:MaxMetaspaceSize=3072m -XX:ParallelGCThreads=8 -XX:ReservedCodeCacheSize=1024m -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=50 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:ThreadStackSize=512 -XX:+DisableExplicitGC -XX:+CMSParallelRemarkEnabled -XX:+GCHistory -XX:+HeapDumpOnOutOfMemoryError -XX:GCHistoryFilename=/app/tomcat-logs/gclog/gc.mo-74e70079e.prf -Xlog:gc*=info,safepoint:file=/app/tomcat-logs/gclog/gcverbose.mo-74e70079e.%p.%t.log:time,uptime,tid,level,tags:filecount=0 -Xbootclasspath/a:/app/jvm/lib/jvmx.jar:/app/jvm/lib/jvmx_tools.jar -XX:HeapDumpPath=/dumps/sc17bcf02t -XX:ErrorFile=/dumps/sc17bcf02t/hs_err_pid@PID_@COUNT.log -XX:ParGCCardsPerStrideChunk=4096 -XX:SurvivorRatio=1

Attaching the GC activities for the same


Ah, I didn't realize you were also setting the api_request_time. That can have an effect if there's a load balancer between the agent and the server. Is that how you have set things up?

See also https://www.elastic.co/guide/en/apm/server/7.6/common-problems.html#io-timeout

Out of curiosity, how are you collecting these events?

I'm not sure if I'm reading it correctly, but it seems that there was a pause for 4 minutes.

Maybe, what we're seeing here is two problems at once? Let's factor out the pause between 9:24-9:30 as it's likely due to the JVM being non-responsive.

Again, I would strongly suggest setting it to a size where you can leverage compressed OOPs. With your configuration, pointers are likely to be 64 bits instead of 32 bits. The limit is usually around 32g. To be safe, I'd start with-Xms30g -Xmx30g. You can verify that compressed OOPs are active by looking at your GC logs for the following message: compressed ordinary object pointers [true].

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.