I'm looking for suggestions on how best to prevent dropping Logstash messages in the Elasticsearch output plugin.
We run a fairly large ES (1.7) cluster fed exclusively by LS(1.5.4) -- on average we process 30k messages/s. When the ES cluster is under stress (assigning shards, relocating shards, or just a large burst of incoming data), I'll regularly see 429 (server busy) responses. In most cases the default LS retry strategy and settings functions well and the all messages are eventually processed. However, in more extreme cases, the LS logs will indicate that too many attempts have been made to send the event (max_retries) and it is dropped.
In this "dropping" scenario, I'd like the internal LS pipeline to be saturated, so processing of new messages effectively halts. So, my questions are:
- When the retry queue reaches capacity (retry_max_items), does this in turn block the ability for the ES output plugin to receive new messages? If so, would setting something like the following be a good approach to ensure (as much as possible) all messages are processed?
retry_max_interval -> high interval (say 60 seconds). I don't want to stress the server with repeated retries
retry_max_items -> low max items (say 100). I want to throttle asap
max_retries -> high max retries ( say 100). I really want the data to go through
- Is there a better approach? My intent is to effectively "back-off" LS message processing to give ES a break while it recovers.
I know there are a lot of factors involved here, so I'll happily elaborate.