Logstash 2.2.0 throughput slower than 2.1.1

I wonder if anyone notices similar problem. I tested updating from 2.1.1 to 2.2.0 and immediate get these warnings in the logs

{:timestamp=>"2016-02-12T01:14:13.718000+0000", :message=>"retrying failed action with response code: 429", :level=>:warn}

Everything else being equal, e.g. same set of flags, configs being used. Marvel show my indexing rate dropped from 55k/s to 40k/s.

Here is my test cluster spec.

31 ES nodes, 64GB RAM (24G heap)
33 dedicated LS nodes, 64GB RAM (2G heap for LS)
ES v2.2.0
LS v2.1.1 (tested upgrade to 2.2.0)
K v4.4.0

The bulk output size with elasticsearch output is 2000. I've tested with using default (500), 1000, 1500, 2000, 2500, etc. Not much difference when using LS 2.2.0. It (index rate) actually got worse at 500, 1000 and 1500, and same at 2000 and 2500 bulk size.

NOTE: the ONLY change was going from LS 2.1.1 to LS 2.2.0. Everything else remain the same.

Tin

Tuning options for 2.2 differ from those of previous Logstash instances. See this blog post section, which includes:

There are three configurable options in the pipeline, --pipeline-workers, --pipeline-batch-size, and --pipeline-batch-delay. The --pipeline-workers or -w parameter determines how many threads to run for filter and output processing. If you find that events are backing up, or that the CPU is not saturated, consider increasing the value of this parameter to make better use of available processing power. Good results can even be found increasing this number past the number of available processors as these threads may spend significant time in an I/O wait state when writing to external systems. Legal values for this parameter are positive integers.

Also, with 2.2 the retry logic has been improved.

Elasticsearch Output Enhancements

Bundled with Logstash 2.2 is the newest version of the Elasticsearch output, designed to work optimally with the NG pipeline, while remaining backward compatible with older 2.x Logstash releases. This version has a vastly simplified and more correct algorithm around retry handling. In this new release each batch going through the NG pipeline blocks until all operations that do not produce unrecoverable errors succeed. This means that network connectivity errors, 429s, and 503s are retried indefinitely now. Previously, the plugin retried these a limited number of times, but we’ve decided to err on the side of data preservation here. Additionally, 409 conflict handling is now only retryable via the retry_on_conflict parameter.

Which protocol are you using to connect to Elasticsearch? We tend to discourage large batch sizes as the retries (if they happen) can cause larger delays with larger batches. A 429 also indicates too many connections to your cluster. Are you round-robining your output to sufficiently spread the indexing load?

I am using the elasticsearch http output. Yes, I am round robin my output to all hosts in the ES cluster (I listed all 31 hosts in the hosts => [] section).

As for batch size, I've tested from not specifying (which should be default of 500), all the way to 2500 records per batch. It's actually worse indexing rate at default, and I still get the 429 errors.

Yep, I saw the changes in flags and for --pipeline-workers, I've tested from not setting any (taking default of 1/2 of my cores, I have 12 cores, HT) all the way to specifying 24. I get best results with it at 24.

I also experimented with --pipeline-batch-size, both at default of 125 and setting to 250. I get better indexing rate at 250.

I've also tested various combination of the above, and current best rate w/2.2.0 is workers at 24, batch size of 250 and bulksize of 2000. This give me around 40K/s on 2.2.0, compared to 55K/s on 2.1.1

I'd appreciate any other tuning steps I can try.

I guess I can test 2.1.2 and see if there is any difference.

Tin

@tinle, have you tried going beyond 24 --pipeline-workers? What about beyond 250 for --pipeline-batch-size? I'm presuming up upgraded both Logstash and ES to 2.2 at the same time? Could a difference in indexing performance be attributed there, including the increase in 429 responses?

I've been using ES 2.2 for awhile now with LS 2.1.1. Only yesterday that I tried upgrading LS to 2.2.0 and saw the slow down in indexing rate.

Let me test with --pipeline-workers larger than 24, and also with larger --pipeline-batch-size.

Tin

I also recommend upgrading to 2.2.1, as there were some memory leaks addressed therein.

OK. I've upgraded to logstash 2.2.1, and was going to re-run tests but run into an ES bug. I've opened a ticket with Elastic support on it, and once I've collected all the data support wants, I'll restart the testing again.

Tin

@tinle thanks for the feedback, it is much appreciated. I am concerned to hear that you're experiencing a performance regression on 2.2. In our internal tests we only have seen performance improvements. What does your pipeline look like? Can you share your config?

As far as tuning options goes, have you tried adjusting your heap size? Running 24 workers * 250 events per batch yields a pretty large inflight count of ~6k events. You might see some benefits by upping your LS_HEAP_SIZE. Can you profile your GCs with VisualVM or similar?

Also, One thing to note with a batch size of 2500. You need to set this in both the elasticsearch output AND the pipeline. I'll open a ticket to fix this bug in the ES output. Currently, if your pipeline batch size is <=500 this isn't an issue, but it is for batch sizes above 500. I opened the following issue: https://github.com/logstash-plugins/logstash-output-elasticsearch/issues/371 . That being said, in our tests we saw performance degrade around the 250 event mark just as you have, but your mileage may vary of course.

Additionally, comparing system resource utilization. Is 2.2 slower and using more CPU or slower using less CPU? You may (believe it or not), get even more performance with > 24 workers. If you're IO bound adding more workers will better saturate those cores. Don't forget to up your heap size as well!

One last thought, don't explicitly set the number of output workers in 2.2.0. We automatically scale them now to the number of workers. You can override this in the config file, but you shouldn't have to anymore.

Hi @andrewvc, thank you for the tips. In my LS config, I use the default workers for elasticsearch output, e.g. I do not specify anything. The only thing I change is the -w workers for LS.

Look like the ES bug I run into has been pinpointed and will be fixed in the ES 5.0 release, so I have restarted the testing. I have updated my cluster from logstash v2.1.1 to v2.1.3. I have also updated ES config to what Elastic support suggested. So I am going to let the cluster run for 12hrs, long enough to get consistent data. I will test what happen tomorrow when I update to logstash v2.2.2.

Tin

I met the same problem too. After upgraded to LS 2.2.2-1 from LS 2.1. With the same configuration in LS 2.1, the thoughout dropped badly. I am trying to increase "pipeline-workers" to 8 from 4, it was still worse than LS 2.1.

I did some testing. The LS 2.2.2's throughout cann't reach the same level with LS 2.1 if we using filebeat as input. but the performance increased for file input.
I am using filebeat 1.1.1 on linux. Is it a bug for LS 2.2.2 beat plugin?

Update. It seem that LS 2.2.x uses more file handles and the elasticsearch output plugin does not distribute connections to hosts evenly.

I see the problem from LS 2.2.x and later Short recap of my test cluster.

31 ES hosts
33 LS hosts

LS configs is same for LS 2.1.3 and LS 2.2.x. Using elasticsearch output plugin. All ES hosts are listed in hosts => [ ".." ...] stanza.

Using LS 2.1.3, I see that open FH used for the first two hosts at 4K each, the other 29 hosts are at 3K. This is constant for 24hrs.

Upgraded one LS to 2.2.2, and instantly I see open FH for the first two hosts jumped to 6K each, the rest of ES hosts stayed at constant of 3K.

What happened before (my ES cluster crashed), is because when I upgraded the entire 33 LS hosts to 2.2.2, the first two ES hosts run out of FHs (the limit is 64K).

That seem to imply to me that LS 2.2.x uses more FHs while running, and that the first two hosts listed in its elasticsearch output get the brunt of the connections.

I've sent this and additional data to support ticket.

Hmmm, it is the case that by default it will spawn the same number of output workers as pipeline workers. This is necessarily more inefficient! It should be the same number of connections as before if you manually specify the number of output workers in your config. Have you tried that?

Next week we will begin work on a refactor that utilizes connections much more efficiently, but I don't have a conclusive timeline on release.

I never set workers in my output section.

Interesting.... with LS 2.2.2, setting workers to 1 helped reduces the number of open FH used on my ES nodes. However, the first two ES nodes listed in the hosts => [] section still get much, much more connections than the rest. This mean I have to be careful and keep the output workers low to prevent running out of FH on those two nodes.

Note: I always list all of ES nodes in my cluster in LS config, and do not use sniffing.

Experimenting with number of workers in the output section, I find that up to 2 is ok. more than that and I start getting 429 errors in my LS log.

It still seem that LS 2.2.2 is causing the index rate to be lower than with using LS 2.1.3 (based on Marvel monitoring). I am going to experiment with LS 2.2.2 and workers in the output section for a bit to see if I can find an optimum setting.

Tin

Keeping everything else the same (see my configs above). The only changes are upgrading from LS 2.1.3 to LS 2.2.2 and the following changes.

Increased LS 2.2.2 maxheap from 2G to 3G. Reduced flush_size to 2000. Experiment with various workers => number.

Max I can get is 3 workers in my output before I start getting 429 errors, so using 2 at the moment. My indexing throughput w/LS 2.2.2 is much worse than w/LS 2.1.3. Approximately 20% drop in indexing throughput as shown by Marvel.

This is pretty bad. I am reverting back to LS 2.1.3 for now.

Well you need to change the /etc/sysconfig/logstash

and under LS_OPTS=""

add -w #CPU number and -b #batch-size

also change your HEAP size as well.

So In my case I have a 3 VMs wiith 32 VCPU and 16 G RAM and I process close to 20 K per sec

LS_OPTS="-w 32 -b 1000"
LS_HEAP_SIZE="8g"

Depend of your message size chose your batch size and heap size. you may get warning like this:

CAUTION: Recommended inflight events max exceeded! Logstash will run with up to 48000 events in memory in your current configuration. If your message sizes are large this may cause instability with the default heap size. Please consider setting a non-standard heap size, changing the batch size (currently 1500), or changing the number of pipeline workers (currently 32)", :level=>:warn}

Also you can use something like htop and check what is going on with with CPU's and Memory.

Gabe

@Gabriel_Rosca see the entire thread above ^^ :slightly_smiling:

I have -w and -b set already. The bug is in elasticsearch output, see @andrewvc comment about the output by default will spawn same number of output workers as pipeline.

I see that Elastic closed one bug, but look like a related bug about using up more filehandles for the first two hosts listed is still there.

I am seeing the FH bug in 2.1.x thru 2.2.x series (inclusive of all versions).