APM Failed to publish events: temporary bulk send failure / Queue is full 503 error

Hi Juan,

i can remember that when we started with APM in version 7.1 i think i had similar issue but it was primary related to huge number of request and only one server.. than.. we added additional APM systems and removed unneeded APM systems and it was working with same or even higher throughput last 2, 3 months.. on 7.2 and 7.3

so yes.. i am pretty sure that it started with 7.4 version..

i know that this temporary bulk send failure is common issue which i happening on beats and because of this issue it puzzles me even more.. I have pretty extensive knowledge of designing and sizing systems and to me there is no system metric which would lead me to conclusion that something is undersized on ELK side (we have logstash on two machines which are doing way bigger load on same ELK and i dont see bulk send failure there

on other hand.. maybe i am doing something wrong and maybe APM is not ready for such load (6xAPM servers x 1000-1500 events/s and i need to add additional APM and/or elastic nodes (processing only)

20y of some experiences teach me that with horizontal sizing i might solve problem from user perspective but problem will still be there.. and 1500 events/sec is not such a huge load

load on servers is never bigger than 2 (8 core server CPU E5-2690 v4, IBM svc Tier 1 storage, 40+GB RAM per node)..

is there something i am missing? like some additional statistics or monitoring which would give me info why this problem is happening in first place (queue full) like where problem on elastic node is..

also.. i am reading today posts here and i see new post from today similar to mine with more or less same problem..

i also have this problem that i need to restart APM.. process is up but it is not recovering..

maybe as a shot term solution to add some restart procedure if queue is full

example of state when it dies:

Thanks in advance
tomislav