[ElasticAgent] failed to publish events: 429 Too Many Requests

Hi, I have several ELK clusters deployed in Elastic cloud via Azure integration. We decided to start integrating Azure logs and elastic agents to configure some alerts on the resources we need. On sandbox environment I deployed an agent and configured Azure Logs integration with event-hub+diagnostic settings, then I see that logs are successfully displayed and there are no issues. When trying to configure the same solution on a more heavily loaded environment, I encountered a 429 status code issue. At the moment I have not found any resource spikes on this environment + we are using autoscaling, I also tried processor settings where I set rate_limit=100/m, which allowed a small part of the logs to reach the elastic, but most of them throw 429 error.
ELK and agent version: 8.13.2
Full error log below:

{"log.level":"error","@timestamp":"2024-09-11T11:38:27.919Z","message":"failed to publish events: 429 Too Many Requests: {\"error\":{\"root_cause\":[{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=107243174, replica_bytes=0, all_bytes=107243174, coordinating_operation_bytes=6544577, max_coordinating_and_primary_bytes=107374182]\"}],\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=107243174, replica_bytes=0, all_bytes=107243174, coordinating_operation_bytes=6544577, max_coordinating_and_primary_bytes=107374182]\"},\"status\":429}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"log.origin":{"file.line":174,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).publishBatch"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-11T11:38:28.657Z","message":"failed to perform any bulk index operations: 429 Too Many Requests: {\"error\":{\"root_cause\":[{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=107067756, replica_bytes=0, all_bytes=107067756, coordinating_operation_bytes=6544577, max_coordinating_and_primary_bytes=107374182]\"}],\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=107067756, replica_bytes=0, all_bytes=107067756, coordinating_operation_bytes=6544577, max_coordinating_and_primary_bytes=107374182]\"},\"status\":429}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":262,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).publishEvents"},"ecs.version":"1.6.0"}

I tested with/without coordination node here is the output:

  1. With coordination node (2Gb or 4Gb):
{"log.level":"error","@timestamp":"2024-09-12T11:33:55.924Z","message":"failed to publish events: 429 Too Many Requests: {\"error\":{\"root_cause\":[{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=214019326, replica_bytes=0, all_bytes=214019326, coordinating_operation_bytes=5949426, max_coordinating_and_primary_bytes=214748364]\"}],\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=214019326, replica_bytes=0, all_bytes=214019326, coordinating_operation_bytes=5949426, max_coordinating_and_primary_bytes=214748364]\"},\"status\":429}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":174,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).publishBatch"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
  1. Without coordination nodes (works better, but data reject error exist):
{"log.level":"error","@timestamp":"2024-09-12T11:54:17.578Z","message":"failed to publish events: 429 Too Many Requests: {\"error\":{\"root_cause\":[{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=799758117, replica_bytes=594367, all_bytes=800352484, coordinating_operation_bytes=6404471, max_coordinating_and_primary_bytes=805306368]\"}],\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of coordinating operation [coordinating_and_primary_bytes=799758117, replica_bytes=594367, all_bytes=800352484, coordinating_operation_bytes=6404471, max_coordinating_and_primary_bytes=805306368]\"},\"status\":429}","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":174,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).publishBatch"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
or
{"log.level":"error","@timestamp":"2024-09-12T12:21:20.214Z","message":"failed to publish events: temporary bulk send failure","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":174,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).publishBatch"},"service.name":"filebeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
or
{"log.level":"error","@timestamp":"2024-09-12T12:19:08.074Z","message":"failed to perform any bulk index operations: Post \"https://${EC_DEPLOYMENT_ID}:443/_bulk?filter_path=errors%2Citems.%2A.error%2Citems.%2A.status\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"elasticsearch","log.origin":{"file.line":262,"file.name":"elasticsearch/client.go","function":"github.com/elastic/beats/v7/libbeat/outputs/elasticsearch.(*Client).publishEvents"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-09-12T12:19:09.106Z","message":"failed to publish events: Post \"https://${EC_DEPLOYMENT_ID}:443/_bulk?filter_path=errors%2Citems.%2A.error%2Citems.%2A.status\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)","component":{"binary":"filebeat","dataset":"elastic_agent.filebeat","id":"azure-eventhub-default","type":"azure-eventhub"},"log":{"source":"azure-eventhub-default"},"service.name":"filebeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","log.origin":{"file.line":174,"file.name":"pipeline/client_worker.go","function":"github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).publishBatch"},"ecs.version":"1.6.0"}

Which version of Elasticsearch are you using? What is the size in terms of nodes and resources of your cluster? How many indices are you actively indexing into?

Hi @Christian_Dahlqvist, Im using ELK 8.13.2 version.

  • autoscaling enabled:
  1. Elastic - 2 Nodes
    Hot data:
    Current: 525 GB storage | 15 GB RAM | 1.9 vCPU
    Max: 1.03 TB storage | 30 GB RAM | 3.9 vCPU
    Total (size x zone) 1.03 TB storage | 30 GB RAM | 3.8 vCPU

Warm data:
400 GB storage | 2 GB RAM | Up to 2.1 vCPU
800 GB storage | 4 GB RAM | Up to 2.1 vCPU

Cold data:
0 MB storage | 0 MB RAM | Up to 0 vCPU
400 GB storage | 2 GB RAM | Up to 2.1 vCPU

Frozen data:
0 MB storage | 0 MB RAM | Up to 0 vCPU
6.25 TB storage | 4 GB RAM | Up to 2.1 vCPU

Indices ~740, but not sure if all are in active state, could you advice how I could check all active indexing?

For Azure Logs integration I see it use data_stream and create 1 index

I tried to use output config according this doc: Elasticsearch output settings | Fleet and Elastic Agent Guide [8.15] | Elastic
looks like it helps a bit, but still observe errors above from time to time.

My output config:

bulk_max_size: 4096
worker: 1
queue.mem.events: 8192
queue.mem.flush.min_events: 4096
queue.mem.flush.timeout: 5s
compression_level: 1
connection_idle_timeout: 15s