I have a ten node cluster running fully on Elastic 5.3 (except Filebeat, which is v5.2.2). I need some tuning recommendations for how to get my indexing rate up (>3k/sec for primary shards).
About my config:
Four nodes are Logstash Servers, three are Master nodes, five data nodes, two coordinating nodes, and one ingest node with the following stats:
My full config can be seen at the bottom of this post.
The problem is that this cluster isn't indexing as fast as data is being pushed to it -- when I look at what days (from the log timestamps) are currently being indexed, I see data from over two weeks ago that was finally indexed in the last 15 minutes:
When I look in the ES logs, I see error code 429, but with the amount of servers, I feel like I shouldn't be receiving this error;
[2017-04-14T08:33:33,108][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@607660d8 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@988bbe9[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 12638848]]"})
[2017-04-14T08:33:33,108][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@15d8c417 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@988bbe9[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 12638849]]"})
[2017-04-14T08:33:33,108][ERROR][logstash.outputs.elasticsearch] Retrying individual actions
[2017-04-14T08:33:33,108][ERROR][logstash.outputs.elasticsearch] Action
[2017-04-14T08:33:33,108][ERROR][logstash.outputs.elasticsearch] Action
.
.
If I drop the number of shards to zero (for bulk indexing, to bring my monitor to near-real-time indexing), I don't see any improvement in primary shard indexing. However, when I reenable sharding, I see the ability of my cluster to index at >12k events/sec:
My questions:
Is there a way to remove all the queued data in Logstash? I imagine the data from previous days is being stored in Logstash queues. This data is no longer valuable and I'd like to simply remove it. The fact that Logstash is using 22GB of memory on just one box makes me think these old events are being held somewhere. Am I able to delete them and just start fresh?
What can I do to increase my indexing rate? I'd like to tune what I already have as it's plenty of hardware for indexing >3k events/sec.
Any overall advice?