Logstash 1.5.0 Performance Regression

Drew_Town · May 18, 2015, 1:56pm

Setup: Redis -> 6 Logstash instances (json codec, date mutate) -> elasticsearch

We have 6 Logstash instances pulling from Redis with a batch size of 1,000 for very small documents <1KB. We don't really do any processing on them, except mutate the date into the @timestamp field. Finally we output them to Elasticsearch in batch sizes of 10,000.

There is one period of the day where we will get about 4.5 million of these documents in our Redis queue and Logstash will chew through them. On 1.4.2 we would get on average 60-80k/second into Elasticsearch. On 1.5.0 we get only 40-60k/second into Elasticsearch. Same configuration and the only change being the version bump in Logstash.

Any idea what could be going on here?

warkolm · May 18, 2015, 8:55pm

I can't suggest anything on the LS side, but you should really reduce the bulk size to ES down to 3-4K.

colinsurprenant · May 19, 2015, 2:30am

Hi,

Before we start looking in logstash for the slowdown, I'd like to first make sure it might not be related to the template change for the elasticsearch output between 1.4.2 and 1.5.0.

If you are using the provided templates, it would be helpful if you could verify if the template was updated between the logstash versions and if its the case, I'd be curious to see if it had an impact on performance.

Also, could you share your configuration?

Thanks,
Colin

Drew_Town · May 19, 2015, 12:45pm

For the template are you referring to the Elasticsearch template for Logstash? If so, I have tweaked it and added some things like doc values, specifically setting float fields, etc so I am not using the default. It did not change from 1.4.2.

I realize that these settings may seem a bit high. But keep in mind that each of these documents is around 30-50 bytes and we found this to give the best performance with 1.4.2.

Logstash Config

input {
 redis {
  host => "[IP]"
  threads => 10
  port => 6379
  data_type => "list"
  batch_count => 1000
  key => "qmstiming"
  type => "qmstiming"
  codec => "json"
 }

}
filter {
 date {
  match => ["time","ISO8601"]
  timezone => "Etc/UTC"
 }
}
output {
  elasticsearch {
    host => "[Elasticsearch Host]"
    protocol => "http"
    flush_size => 10000
  }
}

Drew_Town · May 19, 2015, 1:34pm

I think Warkolm might be on to something here. I'm seeing 72k-100k events/second when I reduced my Elasticsearch flush size to 4000. This is good news as it is faster than what we were seeing with 1.4.2.

colinsurprenant · May 19, 2015, 2:15pm

Ah! good news!

Note that the elasticsearch output plugin has had many improvements since 1.4.2 and I'd be interesting to see if a very large bulk size has a negative impact on the plugin itself. But on the other hand, we don't know the effect of this reduction in bulk size, with your config, on 1.4.2. It would be very interesting to test and see if you see the same improvement or not.

Colin

Drew_Town · May 19, 2015, 2:17pm

I can test tomorrow with 1.4.2 and ES output flush size of 4000. We only see this type of data spike once per day so I will report back tomorrow morning with the results.

warkolm · May 20, 2015, 8:12am

Don't stop at 4K, try reducing it again and see how it goes.

Drew_Town · May 20, 2015, 2:08pm

This is pretty interesting. I have 4 ES nodes and they are all pretty similarly speced. 64GB RAM, Raid 0 2TB HDDs, 16 cores.

Today 1.4.2 performed better than 1.5.0 against all variety of 1.5.0 ES flush sizes. It appears that the 10k and default 1k ES flush size is by far the worst on 1.5.0. 1.4.2 with the 10k ES flush size performed very well.

Version - ES flush size - Node1/Node2/Node3/Node4
1.4.2 - 10k ES flush- 20-25k/20-25k/15-20k/15-20k

1.5.0 - 10k ES flush- 10-15k/10-15k/8kk/8k
1.5.0 - 4k ES flush- 15-20k/15-20k/10-15k/10-15k
1.5.0 - 2k ES flush- 15-20k/15-20k/10-15k/10-15k
1.5.0 - 1k ES flush- 10-15k/10-15k/8-12k/8-12k

It appears that tweaking the ES flush size is part of the issue I'm having with 1.5.0 but it seems like there is something else going on here as well.

colinsurprenant · May 20, 2015, 2:30pm

Drew,

So we are tackling 2 different issues here.

1- possible performance regression between 1.4.2 and 1.5.0
2- bulk size optimization (in 1.5.0)

For (2) there is no magic number and the optimal bulk size depends on your particular setup and the best way to find the optimal bulk size is through empirical tests/benchmarks. As @warkolm suggested, in our experience, a 10k bulk size is possibly non optimal, but again, this really depends on your particular setup (document sizes, indexing/sharding strategy, cluster setup, etc)

for (1) were you able to compare against 1.4.2 + 4k bulk? You also previously mentioned reaching 72k-100k event/s in 1.5.0 using 4k bulk, that is quite different from you last observations?

Drew_Town · May 20, 2015, 3:42pm

I decided to get scientific on it and followed a new procedure rather than looking at Big Desk numbers which included other Logstash instances parsing other types and I can't seem to get an accurate total average from.

I stopped the Logstash instances, noted the time, let Redis build up to 2 million in the queue, started the instances, noted the time upon the first big jump in index count, then noted the time upon which the queue cleared.

Got a total number of seconds from first big jump in index to the queue clearing and got the total number of inserts from the time of the Logstash stop to queue cleared.

In conclusion...1.5.0 is only slightly slower than 1.4.2 in this test.
1.4.2 - 33,688 events/sec
1.5.0 - 31,244 events/sec

I'm going to mark this as essentially closed as the difference is pretty minor (if it's even real). I will continue to monitor 1.5.0 but I'm not sure there is anything wrong with it at this point.

Sorry for the run-around and I really can't wait for Logstash 2 with the health api

Topic		Replies	Views
Redis - Logstash - Elasticsearch is not as fast as I expected Logstash	7	572	July 2, 2019
Elasticsearch Issue with custom json input data using logstash Elasticsearch	4	578	July 6, 2017
Logstash abysmal performance from 2.2.0 to 2.3.1 Logstash	6	1255	July 6, 2017
Logstash at 100% CPU, slow to process Redis queue to Elasticsearch Logstash	3	1068	July 6, 2017
Logstash batch size query Logstash	4	365	May 4, 2020

Logstash 1.5.0 Performance Regression

Related topics