APM 7.6.2: Error 503: Queue is full but server is sleeping

Emanuel_Velzi · May 14, 2020, 4:11pm

I have a problem with my APM Server.

Kibana version: 7.6.2

Elasticsearch version: 7.6.2

APM Server version: 7.6.2

Original install method (e.g. download page, yum, deb, from source, etc.) and version: download page

I'm running a stress test with Locust, sending 600 request per second to a cluster of APM Server, with 6 nodes.
Each request contains more than one event: transactions and spans.
I got them by intercepting some requests from an APM Agent.

APM Server receives that requests and it sends them to a ES cluster, directly to 3 data nodes (hot).

This is my apm-server.yml

apm-server:
host: "0.0.0.0:8200"
rum:
    enabled: true
kibana:
    enabled: true
    host: "monitoring-kibana-00"
queue:
mem:
    events: 368640
    flush.min_events: 2048
    flush.timeout: 1s


#-------------------------- Elasticsearch output --------------------------
output.elasticsearch:
hosts: ["monitoring-es-hot-00", "monitoring-es-hot-01", "monitoring-es-hot-02"]
indices:
    - index: "apm-%{[observer.version]}-sourcemap"
    when.contains:
        processor.event: "sourcemap"

    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "transaction"    
    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "span"
    - index: "apm-%{[observer.version]}-transaction-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "error"
            
    - index: "apm-%{[observer.version]}-metric-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "metric"

    - index: "apm-%{[observer.version]}-onboarding-%{[service.name]}-%{+yyyy.MM.dd}"
    when.contains:
        processor.event: "onboarding"
bulk_max_size: 5120
worker: 36

When I start the test, it runs well until 600 rps aprox. After a while, I am getting 503 errors.
I see this in my logs:

    2020-05-14T11:15:05.102-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.219-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:05.337-0400	ERROR	pipeline/output.go:121	Failed to publish events: temporary bulk send failure
2020-05-14T11:15:07.005-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "a1229fbd-cb1e-42a2-9733-b1909cfad1a2", "method": "POST", "URL": "/intake/v2/events
", "content_length": 4242, "remote_address": "10.194.40.232", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.005-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "72ff2067-1132-4e2c-9582-1ca9d0d20249", "method": "POST", "URL": "/intake/v2/rum/ev
ents", "content_length": 3188, "remote_address": "10.194.16.223", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}
2020-05-14T11:15:07.020-0400	ERROR	[request]	middleware/log_middleware.go:95	queue is full	{"request_id": "be709fed-8d17-497a-813c-fc58c5d8152c", "method": "POST", "URL": "/intake/v2/events
", "content_length": 2044, "remote_address": "10.194.49.55", "user-agent": "python-requests/2.23.0", "response_code": 503, "error": "queue is full"}

However, the APM servers and ES servers are very quite...

I try different configurations for my APM: more workers, more queue.mem.events, so on.. but I can not find what the problem is.

Can someone guide me to understand this, please?

simitt · May 14, 2020, 4:31pm

Hi, can you please let us know if any of the data you are sending to APM Server are indexed into Elasticsearch? You have configured a pretty large memory queue (queue.mem.events). The server only starts raising 503 once this queue is filled up. My suspicion would be that the server can actually never write to Elasticsearch and there might be some connection issue between the APM Server and the ES instances.

Emanuel_Velzi · May 14, 2020, 5:40pm

Actually, the ES does index a lot of documents, until that the APM starts to fail...
I would like to share you some ES error logs but I not found anything really...

I run other test, with 300 rps and the APM works well. It seems the problem is related to the rps.
That would be strange because the servers are ok..

Emanuel_Velzi · May 14, 2020, 6:42pm

I could find some errors in ES:

        5278210-[2020-05-14T14:33:12,595][WARN ][o.e.x.w.e.ExecutionService] [monitoring-es-hot-02] failed to execute watch [257ea514-138a-459b-811d-7e4cbe7909af]
    5278360-[2020-05-14T14:33:15,328][WARN ][o.e.x.w.e.ExecutionService] [monitoring-es-hot-02] failed to execute watch [bd197006-9253-44b5-bbce-444893175c0b]
    5278510:[2020-05-14T14:33:15,740][ERROR][o.e.a.b.TransportBulkAction] [monitoring-es-hot-02] failed to execute pipeline for a bulk request
    5278644-org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.ingest.IngestService$3@667a2794 on EsThreadPoolExecutor[name = monitoring-es-hot-02/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1fed8fa[Running, pool size = 4, active threads = 4, queued tasks = 211, completed tasks = 399150]]
    5279032-	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-7.6.2.jar:7.6.2]
    5279165-	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
    5279252-	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
    5279341-	at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:84) [elasticsearch-7.6.2.jar:7.6.2]
    5279477-	at org.elasticsearch.ingest.IngestService.executeBulkRequest(IngestService.java:342) [elasticsearch-7.6.2.jar:7.6.2]
    5279595-	at org.elasticsearch.action.bulk.TransportBulkAction.processBulkIndexIngestRequest(TransportBulkAction.java:644) [elasticsearch-7.6.2.jar:7.6.2]
    5279741-	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:196) [elasticsearch-7.6.2.jar:7.6.2]
    5279867-	at org.elasticsearch.action.bulk.TransportBulkAction.doExecute(TransportBulkAction.java:97) [elasticsearch-7.6.2.jar:7.6.2]
    5279992-	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:153) [elasticsearch-7.6.2.jar:7.6.2]

Some idea?

simitt · May 15, 2020, 6:48am

The log line with failed to execute pipeline for a bulk request caught my attention. Does the APM Server log also show some errors other than 503 that might indicate that one of the ingest pipelines is causing issues? Which pipelines have you set up?

Emanuel_Velzi · May 15, 2020, 1:27pm

Yes, I see an error in ES related to ingestion.. It sounds like a parser problem, but is strange because the field "labels" is mapped as dynamic:

"labels" : {
"dynamic" : "true",
"properties" : {
  "X-Forwarded-For" : {
    "type" : "keyword"
  },
  "channel" : {
    "type" : "keyword"
  },
  "city_id" : {
    "type" : "keyword"
  },
  ...
}

This is the error:

    [2020-05-15T09:17:00,902][DEBUG][o.e.a.b.TransportShardBulkAction] [monitoring-es-hot-00] [apm-7.6.2-transaction-2020.05.14-000001][0] failed to execute bulk item (create) index {[apm-7.6.2-transaction][_doc][_tB6GHIBuiNKTuY8PM6s], source[n/a, actual length: [2kb], max length: 2kb]}
org.elasticsearch.index.mapper.MapperParsingException: failed to parse field [labels.query_param_origin_city] of type [scaled_float] in document with id '_tB6GHIBuiNKTuY8PM6s'. Preview of field's value: 'CIT_982'
  at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:306) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:488) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseDynamicValue(DocumentParser.java:837) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseValue(DocumentParser.java:619) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:427) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:395) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrField(DocumentParser.java:485) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObject(DocumentParser.java:505) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.innerParseObject(DocumentParser.java:418) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseObjectOrNested(DocumentParser.java:395) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.internalParseDocument(DocumentParser.java:112) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:71) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:267) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.prepareIndex(IndexShard.java:793) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:770) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:742) ~[elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:267) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:157) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:202) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:114) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:81) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:895) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:109) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:374) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:297) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$24(IndexShard.java:2791) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:285) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:237) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2765) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:836) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:293) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:256) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.2.jar:7.6.2]
  at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:762) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.2.jar:7.6.2]
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.2.jar:7.6.2]
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: java.lang.NumberFormatException: For input string: "CIT_982"
  at jdk.internal.math.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2054) ~[?:?]
  at jdk.internal.math.FloatingDecimal.parseDouble(FloatingDecimal.java:110) ~[?:?]
  at java.lang.Double.parseDouble(Double.java:549) ~[?:?]
  at org.elasticsearch.common.xcontent.support.AbstractXContentParser.doubleValue(AbstractXContentParser.java:241) ~[elasticsearch-x-content-7.6.2.jar:7.6.2]
  at org.elasticsearch.index.mapper.ScaledFloatFieldMapper.parse(ScaledFloatFieldMapper.java:474) ~[?:?]
  at org.elasticsearch.index.mapper.ScaledFloatFieldMapper.parseCreateField(ScaledFloatFieldMapper.java:396) ~[?:?]
  at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:284) ~[elasticsearch-7.6.2.jar:7.6.2]
  ... 45 more

Also I think: how many times the APM server retry?
Shouldn't APM identify when an event has an invalid format (like this case) to discard it?
Can I avoid retries completely?

simitt · May 18, 2020, 6:48am

The log shows that the error occurs as labels.query_param_origin_city is expected to be a scaled_float but the value is CIT_892. What this indicates is that the first value of this field was a float, and since everything within labels is dynamically indexed, the mapping is auto-derived from the first value. All following values need to match this mapping type.
Are you adding the labels in the agent or have you set up an ingest pipeline to extract the information and add it as labels? In case you are using pipelines, please have a look at how to handle failures in pipelines.

wajika · May 19, 2020, 2:13am

How do you test the throughput of apm, I also want to do the same test.

Emanuel_Velzi · May 19, 2020, 12:14pm

APM has a template with a dynamic mapping; however I created the first index with my own mapping changing somethings.
I think that I faced a bug here, or maybe someone can help me to understand what happens. This is an example:

1. Create a template:
PUT _template/mytemplate
{
"order" : 1,
"index_patterns" : ["mytemplate*"],
"mappings" : {
"dynamic_templates" : [
{
"labels" : {
"path_match" : "labels.",
"mapping" : {
"type" : "keyword"
},
"match_mapping_type" : "string"
}
},
{
"labels" : {
"path_match" : "labels.",
"mapping" : {
"type" : "boolean"
},
"match_mapping_type" : "boolean"
}
},
{
"labels" : {
"path_match" : "labels.",
"mapping" : {
"scaling_factor" : 1000000,
"type" : "scaled_float"
},
"match_mapping_type" : ""
}
}
]
}
}

2. Create an index with a custom similar mapping (almost the same in this case):
PUT mytemplate_index
{
"mappings": {
"_source": {
"enabled": true
},
"dynamic_templates": [
{
"labels": {
"path_match": "labels.",
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
},
{
"labels": {
"path_match": "labels.",
"mapping": {
"type": "boolean"
},
"match_mapping_type": "boolean"
}
},
{
"labels": {
"path_match": "labels.",
"mapping": {
"scaling_factor": 1000000,
"type": "scaled_float"
},
"match_mapping_type": ""
}
},
{
"myfield": {
"path_match": "myfield",
"mapping": {
"type": "keyword"
},
"match_mapping_type": "string"
}
}
]
}
}

3. Get the index, pay attention to the mapping of labels:
GET mytemplate_index
Result:
{
"mytemplate_index" : {
"aliases" : { },
"mappings" : {
"dynamic_templates" : [
{
"labels" : {
"path_match" : "labels.*",
"mapping" : {
"scaling_factor" : 1000000,
"type" : "scaled_float"
}
}
},
{
"myfield" : {
"path_match" : "myfield",
"match_mapping_type" : "string",
"mapping" : {
"type" : "keyword"
}
}
}
]
},
"settings" : {
"index" : {
"creation_date" : "1589888615574",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "uHo4BCETQqmqvvV6sB5QGA",
"version" : {
"created" : "7060299"
},
"provided_name" : "myt_index"
}
}
}
}

Something similar occurs with APM template. The first time I had created the index manually. Here I suppose that an error occurred, when my custom mapping was combined with the APM template. So the final result was not my expected result.

Once I had solved the mapping, I saw some errors related to the ingestion service. This is related to this:

I was running a test with Locust, sending to APM Server a request previously captured by a network trace. But that requests actually contains a lot of transaction (+20), spans (+50) and metrics. The index rate in ES was very high really (almost higher than I expected) during the test. So I think that the ingestion problems were related to a high index rate. Then APM server queue was fulled of events for retry, so also the APM failed.

Finally, decreasing the rps I can run my test without any problem.

In resume, I think that the errors were:

An invalid mapping, maybe due to an error during index creation (?
Misunderstanding of the tests and how many documents it was sending to ES.

Questions:

How many times the APM server retry the failed ingestion bulks to ES?
Shouldn't APM identify when an event has an invalid format (like my first case) to discard it?
Can I avoid retries completely?

I got a request intercepting traces from APM Java Agent to APM Server.
Then I replicated that requests with Locust a lot of times.

simitt · May 19, 2020, 12:47pm

Thank you for detailed write up @Emanuel_Velzi. It looks like you might run into an issue where dynamic templates with the same key get overwritten. Could you try to rename the keys to unique values, so they are not removed on template merging?

You can use the output.elasticsearch.max_retries option to configure how many retry attempts the server should make. However, the default value is set to 3 and your description sounds like there is an endless retry going on. I will try to reproduce this behavior locally.

Emanuel_Velzi · May 19, 2020, 2:33pm

Yes! but template was created by APM server... so maybe that's an issue.

Actually, in the first case, when the mapping was wrong, a lot of requests failed. The test was sending all the time requests with invalid format to ES. I mean, it's expected the errors were accumulated until to break the ingestion.
Maybe, identifying when an event has an invalid format to discard it could be useful..

Thanks for your help!

simitt · May 25, 2020, 9:20am

Yes! but template was created by APM server... so maybe that's an issue.

The issue I linked only occurs when mappings are merged. My thinking was you might run into this as you additionally provided your own mapping, leading to merging the mapping and only keeping one dynamic template per unique key. But agree, this is an open APM Server bug.

Maybe, identifying when an event has an invalid format to discard it could be useful..

That's actually kind of what the server does, except it interprets the response codes from Elasticsearch, and acts accordingly with dropping the event or retrying it. I tried to reproduce the retries you mentioned locally, but invalid events were dropped as expected in my setup, instead of retried.

You mentioned above that you send pretty large requests in your tests to the APM Server. There is currently an open bug in libbeat (which is used by APM Server) where infinite retries occur when the batch size of the bulk request to Elasticsearch exceeds a limit. Could this be the issue you are running into? If so, you might want to change the configuration for queue.mem.* to send smaller batches to Elasticsearch.

Emanuel_Velzi · May 26, 2020, 6:41pm

It's actually the same in my case I think. I guess the problem really was that in the first case, when the mapping was wrong, a lot of requests failed all the time.. so the queue was quickly filled despite discarding the invalid events.

system · June 16, 2020, 2:41pm

This topic was automatically closed 20 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
APM: 503 Queue is full, server sleeping, nothing helps APM server	10	4322	November 13, 2019
Fine tune APM server settings on the hosted solution (cloud.elastic.co) APM server	2	555	May 30, 2019
APM Agent Queue is ful APM server	2	5230	September 9, 2019
APM queue is full APM server	12	5314	April 5, 2020
Proper APM setup APM ruby , server	5	683	June 28, 2019

APM 7.6.2: Error 503: Queue is full but server is sleeping

Related topics