APM 7.6.2: Error 503: Queue is full but server is sleeping

I have a problem with my APM Server.

Kibana version: 7.6.2

Elasticsearch version: 7.6.2

APM Server version: 7.6.2

Original install method (e.g. download page, yum, deb, from source, etc.) and version: download page

I'm running a stress test with Locust, sending 600 request per second to a cluster of APM Server, with 6 nodes.
Each request contains more than one event: transactions and spans.
I got them by intercepting some requests from an APM Agent.

APM Server receives that requests and it sends them to a ES cluster, directly to 3 data nodes (hot).

This is my apm-server.yml

apm-server:
host: "0.0.0.0:8200"
rum:
    enabled: true
kibana:
    enabled: true
    host: "monitoring-kibana-00"
queue:
mem:
    events: 368640
    flush.min_events: 2048
    flush.timeout: 1s


#-------------------------- Elasticsearch output --------------------------
output.elasticsearch:
hosts: ["monitoring-es-hot-00", "monitoring-es-hot-01", "monitoring-es-hot-02"]
indices:
    - index: "apm-%{[observer.version]}-sourcemap"
    when.contains:
        processor.event: "sourcemap"

    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "transaction"    
    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "span"
    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "error"
            
    - index: "apm-%{[observer.version]}-metric-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "metric"

    - index: "apm-%{[observer.version]}-onboarding-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "onboarding"
bulk_max_size: 5120
worker: 36

When I start the test, it runs well until 600 rps aprox. After a while, I am getting 503 errors.
I see this in my logs:

    2020-05-14T11:15:05.102-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.219-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.337-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:07.005-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "a1229fbd-cb1e-42a2-9733-b1909cfad1a2", "method": "POST", "URL": "/intake/v2/events
", "content_length": 4242, "remote_address": "10.194.40.232", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.005-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "72ff2067-1132-4e2c-9582-1ca9d0d20249", "method": "POST", "URL": "/intake/v2/rum/ev
ents", "content_length": 3188, "remote_address": "10.194.16.223", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.020-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "be709fed-8d17-497a-813c-fc58c5d8152c", "method": "POST", "URL": "/intake/v2/events
", "content_length": 2044, "remote_address": "10.194.49.55", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}

However, the APM servers and ES servers are very quite...

I try different configurations for my APM: more workers, more queue.mem.events, so on.. but I can not find what the problem is.

Can someone guide me to understand this, please?

Hi, can you please let us know if any of the data you are sending to APM Server are indexed into Elasticsearch? You have configured a pretty large memory queue (queue.mem.events). The server only starts raising 503 once this queue is filled up. My suspicion would be that the server can actually never write to Elasticsearch and there might be some connection issue between the APM Server and the ES instances.

Actually, the ES does index a lot of documents, until that the APM starts to fail...
I would like to share you some ES error logs but I not found anything really...

I run other test, with 300 rps and the APM works well. It seems the problem is related to the rps.
That would be strange because the servers are ok..

I could find some errors in ES:

        5278210-[2020-05-14T14:33:12,595][WARN ][o.e.x.w.e.ExecutionService] [monitoring-es-hot-02] failed to execute watch [257ea514-138a-459b-811d-7e4cbe7909af]
    5278360-[2020-05-14T14:33:15,328][WARN ][o.e.x.w.e.ExecutionService] [monitoring-es-hot-02] failed to execute watch [bd197006-9253-44b5-bbce-444893175c0b]
    5278510:[2020-05-14T14:33:15,740][ERROR][o.e.a.b.TransportBulkAction] [monitoring-es-hot-02] failed to execute pipeline for a bulk request
    5278644-org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$3@667a2794 on EsThreadPoolExecutor[name = monitoring-es-hot-02/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1fed8fa[Running, pool size = 4, active threads = 4, queued tasks = 211, completed tasks = 399150]]
    5279032-	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-7.6.2.jar:7.6.2]
    5279165-	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
    5279252-	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
    5279341-	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:84) [elasticsearch-7.6.2.jar:7.6.2]
    5279477-	at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:342) [elasticsearch-7.6.2.jar:7.6.2]
    5279595-	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:644) [elasticsearch-7.6.2.jar:7.6.2]
    5279741-	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:196) [elasticsearch-7.6.2.jar:7.6.2]
    5279867-	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:97) [elasticsearch-7.6.2.jar:7.6.2]
    5279992-	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.6.2.jar:7.6.2]

Some idea?

The log line with failed to execute pipeline for a bulk request caught my attention. Does the APM Server log also show some errors other than 503 that might indicate that one of the ingest pipelines is causing issues? Which pipelines have you set up?

Yes, I see an error in ES related to ingestion.. It sounds like a parser problem, but is strange because the field "labels" is mapped as dynamic:

"labels" : {
"dynamic" : "true",
"properties" : {
  "X-Forwarded-For" : {
    "type" : "keyword"
  },
  "channel" : {
    "type" : "keyword"
  },
  "city_id" : {
    "type" : "keyword"
  },
  ...
}

This is the error:

    [2020-05-15T09:17:00,902][DEBUG][o.e.a.b.TransportShardBulkAction] [monitoring-es-hot-00] [apm-7.6.2-transaction-2020.05.14-000001][0] failed to execute bulk item (create) index {[apm-7.6.2-transaction][_doc][_tB6GHIBuiNKTuY8PM6s], source[n/a, actual length: [2kb], max length: 2kb]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [labels.query_param_origin_city] of type [scaled_float] in document with id '_tB6GHIBuiNKTuY8PM6s'. Preview of field's value: 'CIT_982'
  at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:306) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:488) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:837) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:619) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:427) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:395) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:485) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:505) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:418) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:395) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:112) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:71) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:267) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:793) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:770) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:742) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:267) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:157) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:202) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:114) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:81) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:895) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:109) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:374) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:297) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$24(IndexShard.java:2791) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:285) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:237) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2765) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:836) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:293) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:256) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.2.jar:7.6.2]
  at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:762) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: java.lang.NumberFormatException: For input string: "CIT_982"
  at jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054) ~[?:?]
  at jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110) ~[?:?]
  at java.lang.Double.parseDouble(Double.java:549) ~[?:?]
  at org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:241) ~[elasticsearch-x-content-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.ScaledFloatFieldMapper.parse(ScaledFloatFieldMapper.java:474) ~[?:?]
  at org.elasticsearch.index.mapper.ScaledFloatFieldMapper.parseCreateField(ScaledFloatFieldMapper.java:396) ~[?:?]
  at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:284) ~[elasticsearch-7.6.2.jar:7.6.2]
  ... 45 more

Also I think: how many times the APM server retry?
Shouldn't APM identify when an event has an invalid format (like this case) to discard it?
Can I avoid retries completely?

The log shows that the error occurs as labels.query_param_origin_city is expected to be a scaled_float but the value is CIT_892. What this indicates is that the first value of this field was a float, and since everything within labels is dynamically indexed, the mapping is auto-derived from the first value. All following values need to match this mapping type.
Are you adding the labels in the agent or have you set up an ingest pipeline to extract the information and add it as labels? In case you are using pipelines, please have a look at how to handle failures in pipelines.

How do you test the throughput of apm, I also want to do the same test.

APM has a template with a dynamic mapping; however I created the first index with my own mapping changing somethings.
I think that I faced a bug here, or maybe someone can help me to understand what happens. This is an example:

1. Create a template:
PUT _template/mytemplate
{
"order" : 1,
"index_patterns" : ["mytemplate*"],
"mappings" : {
"dynamic_templates" : [
{
"labels" : {
"path_match" : "labels.",
"mapping" : {
"type" : "keyword"
},
"match_mapping_type" : "string"
}
},
{
"labels" : {
"path_match" : "labels.
",
"mapping" : {
"type" : "boolean"
},
"match_mapping_type" : "boolean"
}
},
{
"labels" : {
"path_match" : "labels.",
"mapping" : {
"scaling_factor" : 1000000,
"type" : "scaled_float"
},
"match_mapping_type" : "
"
}
}
]
}
}

2. Create an index with a custom similar mapping (almost the same in this case):
PUT mytemplate_index
{
"mappings": {
"_source": {
"enabled": true
},
"dynamic_templates": [
{
"labels": {
"path_match": "labels.",
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
},
{
"labels": {
"path_match": "labels.
",
"mapping": {
"type": "boolean"
},
"match_mapping_type": "boolean"
}
},
{
"labels": {
"path_match": "labels.",
"mapping": {
"scaling_factor": 1000000,
"type": "scaled_float"
},
"match_mapping_type": "
"
}
},
{
"myfield": {
"path_match": "myfield",
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
}
]
}
}

3. Get the index, pay attention to the mapping of labels:
GET mytemplate_index
Result:
{
"mytemplate_index" : {
"aliases" : { },
"mappings" : {
"dynamic_templates" : [
{
"labels" : {
"path_match" : "labels.*",
"mapping" : {
"scaling_factor" : 1000000,
"type" : "scaled_float"
}
}
},
{
"myfield" : {
"path_match" : "myfield",
"match_mapping_type" : "string",
"mapping" : {
"type" : "keyword"
}
}
}
]
},
"settings" : {
"index" : {
"creation_date" : "1589888615574",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "uHo4BCETQqmqvvV6sB5QGA",
"version" : {
"created" : "7060299"
},
"provided_name" : "myt_index"
}
}
}
}

Something similar occurs with APM template. The first time I had created the index manually. Here I suppose that an error occurred, when my custom mapping was combined with the APM template. So the final result was not my expected result.

Once I had solved the mapping, I saw some errors related to the ingestion service. This is related to this:

I was running a test with Locust, sending to APM Server a request previously captured by a network trace. But that requests actually contains a lot of transaction (+20), spans (+50) and metrics. The index rate in ES was very high really (almost higher than I expected) during the test. So I think that the ingestion problems were related to a high index rate. Then APM server queue was fulled of events for retry, so also the APM failed.

Finally, decreasing the rps I can run my test without any problem.

In resume, I think that the errors were:

  • An invalid mapping, maybe due to an error during index creation (?
  • Misunderstanding of the tests and how many documents it was sending to ES.

Questions:

  • How many times the APM server retry the failed ingestion bulks to ES?
  • Shouldn't APM identify when an event has an invalid format (like my first case) to discard it?
  • Can I avoid retries completely?


I got a request intercepting traces from APM Java Agent to APM Server.
Then I replicated that requests with Locust a lot of times.

Thank you for detailed write up @Emanuel_Velzi. It looks like you might run into an issue where dynamic templates with the same key get overwritten. Could you try to rename the keys to unique values, so they are not removed on template merging?

You can use the output.elasticsearch.max_retries option to configure how many retry attempts the server should make. However, the default value is set to 3 and your description sounds like there is an endless retry going on. I will try to reproduce this behavior locally.

Yes! but template was created by APM server... so maybe that's an issue.

Actually, in the first case, when the mapping was wrong, a lot of requests failed. The test was sending all the time requests with invalid format to ES. I mean, it's expected the errors were accumulated until to break the ingestion.
Maybe, identifying when an event has an invalid format to discard it could be useful..

Thanks for your help!

Yes! but template was created by APM server... so maybe that's an issue.

The issue I linked only occurs when mappings are merged. My thinking was you might run into this as you additionally provided your own mapping, leading to merging the mapping and only keeping one dynamic template per unique key. But agree, this is an open APM Server bug.

Maybe, identifying when an event has an invalid format to discard it could be useful..

That's actually kind of what the server does, except it interprets the response codes from Elasticsearch, and acts accordingly with dropping the event or retrying it. I tried to reproduce the retries you mentioned locally, but invalid events were dropped as expected in my setup, instead of retried.

You mentioned above that you send pretty large requests in your tests to the APM Server. There is currently an open bug in libbeat (which is used by APM Server) where infinite retries occur when the batch size of the bulk request to Elasticsearch exceeds a limit. Could this be the issue you are running into? If so, you might want to change the configuration for queue.mem.* to send smaller batches to Elasticsearch.

It's actually the same in my case I think. I guess the problem really was that in the first case, when the mapping was wrong, a lot of requests failed all the time.. so the queue was quickly filled despite discarding the invalid events.