Output Event Rates are showing gaps in APM Stack Monitoring page

Hi, I am using On premise elastic stack 7.9.0. And recently we are facing outofmemory issue on one of the java agents.
The APM CallTrace objects are taking more than 2GB which is causing the issue.
Also, we are seeing some gaps in the "Output Event Rate" graph in APM Stack monitoring.

May I know the meaning of these breaks in the graph? And any remedy to the APM Agent OutOfMemory?

@vamsikrishna_medeti Sorry for the late reply.

If a query/process is very heavy (causing OOM errors), the ES Stack Monitoring plugin will start throttling collection rate thus resulting in gaps on the chart. Couple of things we'll need to figure out first.

  • During this period when the chart is showing gaps, are there any logs/errors in the ES/Kibana console?

  • Have you tried identifying the query causing the OOM? (usually the one that is the slowest). You can do this via:

monitoring.elasticsearch.hosts: ["http://localhost:9200"]
monitoring.elasticsearch.logQueries: true
logging.verbose: true
  • Try running the output rate query independently and see if it occasionally times out (or if the result also has gaps). Be sure to replace your own cluster_uuid:
GET .monitoring-beats-6-*,.monitoring-beats-7-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "cluster_uuid": "Q4FBbFszTj6jnCWhBG0Pgw"
          }
        },
        {
          "range": {
            "beats_stats.timestamp": {
              "gte": "now-1h"
            }
          }
        }
      ]
    }
  },
  "aggs": {
    "check": {
      "date_histogram": {
        "field": "beats_stats.timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "metric": {
          "max": {
            "field": "beats_stats.metrics.libbeat.output.events.total"
          }
        },
        "metric_deriv": {
          "derivative": {
            "buckets_path": "event_rate",
            "gap_policy": "skip",
            "unit": "1s"
          }
        },
        "beats_uuids": {
          "terms": {
            "field": "beats_stats.beat.uuid",
            "size": 1
          },
          "aggs": {
            "event_rate_per_beat": {
              "max": {
                "field": "beats_stats.metrics.libbeat.output.events.total"
              }
            }
          }
        },
        "event_rate": {
          "sum_bucket": {
            "buckets_path": "beats_uuids>event_rate_per_beat",
            "gap_policy": "skip"
          }
        }
      }
    }
  }
}
  • One thing to also try is different time ranges, so instead of the default 1h ago try things like 15m or 6h etc. This way we can figure out if it's a max bucket issue

  • This might also be because the cluster resources are under provisioned. Have you tried increasing nodes/memory (JVM)?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.