HELP... How to Increase bulk Index rate performance - 15 Node Elasticsearch Cluster (error 429)

Need Help!

How do I increase bulk index rate/throughput to my cluster? Elasticsearch keeps throttling back sending 429 errors seen from Logstash debug logs. What settings do I need to tweak in either Elasticsearch or logstash to get better throughput performance. I am using 2 beefy Logstash nodes reading from file archived EVTX Windows Event Logs in JSON files, and archived BRO Logs (text). I also have filebeats sending bro logs from 2 live sensors feeding the cluster. The best I have been able to get is 23,000 Primary shards index per second. I currently have 5 shards with 1 replica per index and i am indexing by day based off event time (logs).

I have have tried a lot of things. I tweaked logstash configs and elasticsearch settings in the .yml and no change. the only positive change i got was when i added the following under the output elasticsearch module:
flush_size => 10000
pool_max => 5000
pool_max_per_route => 2500
retry_initial_interval =>10
timeout => 120

Here are some details about my cluster:
Elastic Stack 5.0; 15 Nodes; 12 DN; 2 Coordinate Nodes; 1 master (3 Eligible Masters)
Hardware: on average each data node has 50 GB RAM with 3-8 TB of SSD and 11-16 CPU cores each. I have 2 logstash Nodes that have 16 CPU Cores, 64 GB of RAM and 800 TB SSD.

Hi @jeriel20,

the Definitive Guide has a few tips to increase indexing performance.

Daniel

Thanks. Unfortunately I have already read the definitive guide and implemented settings based on recommendations from the guide. What I did tweak today was the http.pipelining.max_events and the thread_pool bulk size. I also kicked back the bulk requests to 15mb (down from 20 mb) per bulk request. Hopefully these changes increase performance. Thanks

Hi @jeriel20,

great that you've already checked the Definitive Guide. As you describe it, Logstash does not seem to be your problem because it is Elasticsearch that's sending the HTTP 429s. While you should definitely vary the bulk size to find a sweet spot you should also check on which resource you're actually bottlenecked. You can use the usual suspects here (iostat, top) and look in Marvel for ES specific metrics.

A few (random) things to watch out for:

  • check the refresh interval of your indices and set it as high as you can afford.
  • Also check your translog settings but be aware that this can have an impact on the safety of your data.
  • Check your mappings. Maybe you don't need the _all field for example?

Process-wise I'd try to find out where the bottleneck is, then think about the required changes and then apply them one at a time and revert the change if it doesn't help.

I hope that helps you in improving your indexing performance.

Daniel

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.