APM queue is full

Kibana version: 7.5.2

Elasticsearch version: 7.5.2

APM Server version: 7.6.0

APM Agent language and version: Java 1.12.0

Browser version:

Original install method (e.g. download page, yum, deb, from source, etc.) and version: kubernetes (elastic operator and elastic-apm helm chart)

We are outputing directly to elasticsearch

Description of the problem including expected versus actual behavior. Please include screenshots (if relevant):


The APM pods network output traffic (700Mbit/s) is a lot more than the input traffic (50Mbit/s).
Please tell us how is it possible for the queues to fill when we are sending more network data than incoming network data ? I mean we are sending more data than what we are getting.
the max queue size on the java app is around 5120 and we have around 300 instances connecting to our APM server.
There are no rejections in write index queue in elasticsearch.

We are indexing an average of 35000 events per second on the elasticsearch side.

We have 1 apm server pod and 2 elasticsearch pods (resource counts are below)
here's the related config:
setup.template.settings:
index.number_of_shards: 4
index.number_of_routing_shards: 28
queue:
mem:
events: 5000000
flush.min_events: 0
flush.timeout: 1s
output.elasticsearch:
hosts: ["elasticsearch-es-http.elastic-system.svc.cluster.local:9200"]
worker: 30
bulk_max_size: 20000

Resources for services:
apmServerResources (1 count)
limits:
cpu: "15"
memory: 120Gi
requests:
cpu: "15"
memory: 120Gi

elasticSearchResources (2 in count)
limits:
cpu: "15"
memory: 120Gi
requests:
cpu: "15"
memory: 120Gi

Provide logs and/or server output (if relevant):
"response_code": 503, "error": "queue is full"

Hi @Rakesh_B,

the JAVA agent sends data in a compressed format to the APM Server, whereas the APM Server by default sends data uncompressed to Elasticsearch. The APM Server compression can be customized by configuring output.elasticsearch.compression_level, valid values are [0..9].

Hi @simitt,
Thank you for your reply.

  • If I send data uncompressed to elasticsearch would it make the indexing faster?
    OR
  • should I send data uncompressed from the JAVA APM agent to make the indexing faster?

which of the above options are better to improve indexing performance?

My response was meant to clarify why measured network data might differ.

For performance tuning I suggest to take a look at the tune data ingestion section of the APM Server and the overhead and performance tuning section of the JAVA agent.

Hello @Rakesh_B

I'm doing benchmarking test with Elastic APM, and had a problem with 1000 TPS traffic.
The time for loading Kibana APM pages is too long even with 1000 TPS traffic of 24 hours data.
(It took more than 20 seconds)

I posted a a question on this forum and here is the link.

My question is, could you let me know that you do(or don't) experience the same issue(Kibana APM pages are too slow)?

Hi,
I definitely feel that APM graphs are slower than filebeat logs when loading in kibana. But we have no problem loading metrics with 1000s of tps. Check on your es search latency and the volumes.

Hi @Rakesh_B,

Thanks you for reply.

I checked es search latency, and the es query itself took 20 secs.

I think the big differences between your environment and mine are

  • memory : 120Gi vs 12Gi
  • index template : index.number_of_routing_shards : 28 vs 1 (default value)

I'll test to see if these two points are related with my problem.

Could you please check if there are any other hint for my problem?
Here is my test environment

setup.template.settings:
index.number_of_shards: 5

Resources for services :
elasticSearchResources (10 in count)
limits:
cpu: "14"
memory: 12Gi
requests:
cpu: "14"
memory: 12Gi

Node H/W Spec :

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

You have 10 nodes with 14 CPUs and 12 GB RAM.
Yesterday, I updated our capacity to 3 times what was mentioned above and distributed the capacity on 6 data nodes. We started seeing huge differences, we shipped 1GB of APM data every 22 seconds. Search speed isn't that great because I gave only 20GB of heap space. But I still saw "queue is full" errors though. Which means we need more capacity or tune APM to discard unwanted data.

elasticSearchDataNodeCount: 6
elasticSearchMasterNodeCount: 3
elasticSearchResources:
limits:
cpu: "15"
memory: 123Gi
requests:
cpu: "15"
memory: 123Gi
elasticSearchJavaOpts: "-Xms20g -Xmx20g"

I was also wrong before, APM data loads much faster than filebeat, I can see the latest data for up to 1-2 minutes

Hello, @Rakesh_B

I ran a test with the same environment you did, but still Slow, and got this screen.
(It's because Elasticsearch didn't respond within 30 secs)

I'm just wondering if you check the time with this screen?

Here is full yaml that I used.

apiVersion: apm.k8s.elastic.co/v1
kind: ApmServer
metadata:
  labels:
    com.netmarble/layer: tool
    com.netmarble/zone: test
  name: heimdall-storage-test
spec:
  config:
    logging:
      metrics.enabled: true
    max_proc: 2
    output.elasticsearch:
      bulk_max_size: 5120
      worker: 10
    queue.mem.events: 51200
    setup.template.settings:
      index.number_of_replicas: 1
      index.number_of_shards: 5
      index.refresh_interval: 10s
      number_of_routing_shards: 30
  count: 1
  elasticsearchRef:
    name: heimdall-storage-test
  http:
    service:
      spec:
        type: LoadBalancer
  podTemplate:
    metadata:
      labels:
        app: apmServer
        project: paas
    spec:
      containers:
      - env:
        - name: ES_JAVA_OPTS
          value: -Xms800m -Xmx800m
        name: apm-server
        resources:
          limits:
            cpu: 1
            memory: 1Gi
          requests:
            cpu: 1
            memory: 1Gi
  version: 7.6.1
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  labels:
    com.netmarble/layer: tool
    com.netmarble/zone: test
  name: heimdall-storage-test
spec:
  http:
    service:
      spec:
        type: LoadBalancer
  nodeSets:
  - config:
      node.data: true
      node.ingest: true
      node.master: true
      node.store.allow_mmap: true
    count: 3
    name: node
    podTemplate:
      spec:
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms48g -Xmx48g
          name: elasticsearch
          resources:
            limits:
              cpu: 15
              memory: 120Gi
            requests:
              cpu: 15
              memory: 120Gi
        initContainers:
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-gcs
          name: install-gcs-plugins
        - command:
          - sh
          - -c
          - bin/elasticsearch-plugin install --batch repository-hdfs
          name: install-hdfs-plugins
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 200Gi
        storageClassName: standard
  secureSettings:
  - secretName: gcs-credentials
  version: 7.6.1

You are right, APM UI is having issues, I'm facing a similar problem, I've opened an issue here - APM UI Kibana Internal Server Error

The slowness problem is 100% reproducible for me. I tested more than 20 times with various configuration(number_of_shards, memory, cpu, codec, eager_global_oridinals etc) and the result are almost same. I got small improvements but still takes more than 20 secs to load a single page.
I've also opened an SLOWNESS ISSUE 20 days ago, and still waiting for an answer.

I cannot understand why nobody on this forum ask about it?