Response code: 429

pjanzen · December 8, 2016, 7:01pm

Hi All,

I am getting below error and I understand from other thread I am overloading elasticsearch. I am looking on what to dail down to get rid of this problem and get the most of performance.

I have 9 LS nodes.

My logstash info is:
Version: 5.0.2 (so is elasticsearch)
logstash.yml:
node.name: logstash2
path.data: /var/lib/logstash
pipeline.workers: 16
pipeline.output.workers: 16
pipeline.batch.size: 400
pipeline.batch.delay: 5
path.config: /etc/logstash/conf.d
http.host: "0.0.0.0"
http.port: 9600-9700
log.level: info
path.logs: /opt/logstash/logs

jvm.options:
-Xms12g
-Xmx12g
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+DisableExplicitGC
-Djava.awt.headless=true
-Dfile.encoding=UTF-8
-XX:+HeapDumpOnOutOfMemoryError

There are similar topics but not that specific, I hope you can help.

[2016-12-08T19:55:23,743][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$6@3519a421 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@10b9f5a4[Running, pool size = 6, active threads = 6, queued tasks = 54, completed tasks = 1047668]]"})

Thanks,
Paul.

theuntergeek · December 8, 2016, 7:26pm

9 Logstash nodes? How many events per second for each? In aggregate? How big is your Elasticsearch cluster?

A 429 is not the end of the world, since Logstash will keep retrying data that got rejected. But a steady stream of them tends to mean that Elasticsearch can't keep up with the amount of data you're sending. Dialing down in such a case means dropping some data (just not sending it, filtering it out, etc), which may not be what you intend or want.

You may actually want to keep all of that data. If so, solutions include expanding your Elasticsearch cluster.

pjanzen · December 8, 2016, 7:33pm

Those 9 LS node are legacy and transformed from kugh graylog and that needed 9. In the elasticsearch cluster I have 3 master nodes, 6 data only nodes (32g mem 16G for ES and 8 cpu's) and 3 coordinator nodes for heavy searches. I can expand the ES data nodes with 3 more nodes which I can take from logstash.

So the logstash config I currently have is not so weird then?

theuntergeek · December 8, 2016, 7:48pm

Not at all, if the need is there. I still don't know what kind of documents you're indexing, at what rate, and how many different mappings (or a high count of fields), or any other indicators which would help me with a ballpark guess at how busy your cluster might be. If your Logstash nodes aren't overburdened, then more ES data nodes can be a good thing for your ingest performance.

Also, I would look at setting

hosts => [ "host1", "host2", "host3" ]

in your Elasticsearch output, where hosts 1, 2, and 3 are the coordinator nodes, so that the indexing requests are load balanced (round-robin).

pjanzen · December 8, 2016, 7:56pm

Hi Aaron,

We are index indexing exim, dovecot, opendj, apache and cloudmark logs.
For most of the I also do a GeoIP lookup and do some CIDR filter stuff.

Here is the complete conf https://gist.github.com/pjanzen/5ce31ee992a3f28cd7434d2a529fe6d9

I whiped up a small script that read the LS API and that told be this.

1506498 processed: 64672 msg last 10s (6467 avg p/s)
1532100 processed: 25602 msg last 10s (2560 avg p/s)
1573331 processed: 41231 msg last 10s (4123 avg p/s)
1640125 processed: 66794 msg last 10s (6679 avg p/s)
1710533 processed: 70408 msg last 10s (7040 avg p/s)
1766988 processed: 56455 msg last 10s (5645 avg p/s)
1807539 processed: 40551 msg last 10s (4055 avg p/s)
1827546 processed: 20007 msg last 10s (2000 avg p/s)
1888750 processed: 61204 msg last 10s (6120 avg p/s)
1921156 processed: 32406 msg last 10s (3240 avg p/s)
1943158 processed: 22002 msg last 10s (2200 avg p/s)

theuntergeek · December 8, 2016, 8:01pm

The math is not precise when doing this, but it appears that you are doing around 50,000 events per second. That's actually a pretty decent number. Are your ES nodes on spinning disks?

At any rate, spreading that load across a few more data nodes will help prevent 429s. It could also be that you get burst activity, which would also increase the likelihood of 429s. But if you take 3 nodes from Logstash away, will the others still be able to keep up? That becomes an issue. I don't know more of your filter pipeline, but we may be able to help you optimize it.

pjanzen · December 8, 2016, 8:12pm

The complete cluster is on vmware with NETAPP san storage under it (I don't know the details) if push come to shuf I can spin more nodes without any issue... My complete filter is in the gist link I posted.

Talks for licenses and support are underway that will happen soon and until then I am relaying on IRC, discuss and google

Thanks for the help so far,
Regards,
Paul.

theuntergeek · December 8, 2016, 8:24pm

Those are interesting Redis configuration blocks. Have you seen https://www.elastic.co/blog/just_enough_redis_for_logstash ? The huge batch count and thread count may be limiting performance. The aforementioned blog post may reveal some tweaks you can make.

Awesome use of the new dissect filter, by the way. If I did have to point to something suspicious, it would be any use of GREEDYDATA in grok. That's a very expensive regex to use, and you have it in there a few times. You might want to find if there's any other way to extract the data there.

    geoip {
      source => imp_ip
      target => "geoip"
      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]
    }

You already get the [geoip][location] field by default, which is an array of [longitude, latitude]. If you need to have it be named coordinates you can always use a mutate filter to rename [geoip][location] to [geoip][coordinates]. That saves a few steps for you. You won't even have to convert it to a float.

pjanzen · December 8, 2016, 9:23pm

Hi Aaron,

Thanks for the feedback, I have change the redis input and removed the batch size and there I see a immidiate increase in performance.

I am quite pleased with dissect and hade great fun with it. I agree about the GREEDYDATA I need to remove it asap, it is prio 2 on my task list

About the GeoIP stuff, how would you write it? (read: I do not understand what you mean and need an example)

Thanks for the help so far.
Paul.

theuntergeek · December 9, 2016, 12:50am

So, with regards to GeoIP, the plugin automatically makes a geopoint recognized field called location. The add_field lines:

      add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
      add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}"  ]

...are effectively creating a redundant coordinates field. Chances are good that you have Kibana dashboards which are using coordinates, so you should work to preserve that standard, or slowly switch to using location. To convert/rename the existing location field to a coordinates field, you would use:

mutate {
  rename => { "[geoip][location]" => "[geoip][coordinates]" }
}

system · January 6, 2017, 12:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error in logstash logs: " retrying failed action with response code: 429 Elasticsearch	6	4607	October 15, 2019
Response code 429 Logstash	9	9270	July 6, 2017
Getting 429 errors in Elasticsearch output Logstash	1	1092	March 1, 2017
Logstash: retrying failed action with response code: 429 Logstash	5	23794	November 2, 2017
How to Improve ES responsiveness when getting warning: retrying failed action with response code: 429 Logstash	3	1293	July 5, 2022

Response code: 429

Related topics