We have 6 Logstash instances pulling from Redis with a batch size of 1,000 for very small documents <1KB. We don't really do any processing on them, except mutate the date into the @timestamp field. Finally we output them to Elasticsearch in batch sizes of 10,000.
There is one period of the day where we will get about 4.5 million of these documents in our Redis queue and Logstash will chew through them. On 1.4.2 we would get on average 60-80k/second into Elasticsearch. On 1.5.0 we get only 40-60k/second into Elasticsearch. Same configuration and the only change being the version bump in Logstash.
Before we start looking in logstash for the slowdown, I'd like to first make sure it might not be related to the template change for the elasticsearch output between 1.4.2 and 1.5.0.
If you are using the provided templates, it would be helpful if you could verify if the template was updated between the logstash versions and if its the case, I'd be curious to see if it had an impact on performance.
For the template are you referring to the Elasticsearch template for Logstash? If so, I have tweaked it and added some things like doc values, specifically setting float fields, etc so I am not using the default. It did not change from 1.4.2.
I realize that these settings may seem a bit high. But keep in mind that each of these documents is around 30-50 bytes and we found this to give the best performance with 1.4.2.
I think Warkolm might be on to something here. I'm seeing 72k-100k events/second when I reduced my Elasticsearch flush size to 4000. This is good news as it is faster than what we were seeing with 1.4.2.
Note that the elasticsearch output plugin has had many improvements since 1.4.2 and I'd be interesting to see if a very large bulk size has a negative impact on the plugin itself. But on the other hand, we don't know the effect of this reduction in bulk size, with your config, on 1.4.2. It would be very interesting to test and see if you see the same improvement or not.
I can test tomorrow with 1.4.2 and ES output flush size of 4000. We only see this type of data spike once per day so I will report back tomorrow morning with the results.
This is pretty interesting. I have 4 ES nodes and they are all pretty similarly speced. 64GB RAM, Raid 0 2TB HDDs, 16 cores.
Today 1.4.2 performed better than 1.5.0 against all variety of 1.5.0 ES flush sizes. It appears that the 10k and default 1k ES flush size is by far the worst on 1.5.0. 1.4.2 with the 10k ES flush size performed very well.
Version - ES flush size - Node1/Node2/Node3/Node4
1.4.2 - 10k ES flush- 20-25k/20-25k/15-20k/15-20k
1.5.0 - 10k ES flush- 10-15k/10-15k/8kk/8k
1.5.0 - 4k ES flush- 15-20k/15-20k/10-15k/10-15k
1.5.0 - 2k ES flush- 15-20k/15-20k/10-15k/10-15k
1.5.0 - 1k ES flush- 10-15k/10-15k/8-12k/8-12k
It appears that tweaking the ES flush size is part of the issue I'm having with 1.5.0 but it seems like there is something else going on here as well.
1- possible performance regression between 1.4.2 and 1.5.0
2- bulk size optimization (in 1.5.0)
For (2) there is no magic number and the optimal bulk size depends on your particular setup and the best way to find the optimal bulk size is through empirical tests/benchmarks. As @warkolm suggested, in our experience, a 10k bulk size is possibly non optimal, but again, this really depends on your particular setup (document sizes, indexing/sharding strategy, cluster setup, etc)
for (1) were you able to compare against 1.4.2 + 4k bulk? You also previously mentioned reaching 72k-100k event/s in 1.5.0 using 4k bulk, that is quite different from you last observations?
I decided to get scientific on it and followed a new procedure rather than looking at Big Desk numbers which included other Logstash instances parsing other types and I can't seem to get an accurate total average from.
I stopped the Logstash instances, noted the time, let Redis build up to 2 million in the queue, started the instances, noted the time upon the first big jump in index count, then noted the time upon which the queue cleared.
Got a total number of seconds from first big jump in index to the queue clearing and got the total number of inserts from the time of the Logstash stop to queue cleared.
In conclusion...1.5.0 is only slightly slower than 1.4.2 in this test.
1.4.2 - 33,688 events/sec
1.5.0 - 31,244 events/sec
I'm going to mark this as essentially closed as the difference is pretty minor (if it's even real). I will continue to monitor 1.5.0 but I'm not sure there is anything wrong with it at this point.
Sorry for the run-around and I really can't wait for Logstash 2 with the health api
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.