Very slow processing of CDN logs

UnitedMarsupials · June 22, 2017, 4:07pm

I've configured logstash to parse CDN-uploaded logs and feed them into Graphite. Things work, but very slowly -- a log-file with 278K lines takes 6.5 minutes to process.

The development instance of logstash copies production's settings, but has none of the production's log-sources active. When it was working on the sample log, I fired up top(1) and saw the following:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16289 logstash 39 19 3537696 322376 1416 S 175.5 2.6 18:24.52 java
25976 _graphi+ 20 0 143296 19644 1272 R 13.6 0.2 2:55.38 carbon-cache

Our logstash version is 2.3.4 -- the latest available for Ubuntu at this time. It is using the following JRuby:

jruby 1.7.23 (1.9.3p551) 2015-11-24 f496dd5 on OpenJDK 64-Bit Server VM 1.7.0_131-b00 +jit [linux-amd64]

Is logstash simply slow -- a price one has to pay for its versatility -- or is this result too slow and we are doing something stupid? My suspects are:

Multiple "hairy" match-lines -- the CDN's log-entries can be in slightly different formats, we list four of them

Conditional outputs -- because we want to have only one logstash-daemon running, we set tags in the inputs:

  input {
          unix {
  	  path => "/tmp/logstash-{{ logsource.name }}.sock"
  	  tags => ["{{ logsource.name }}", "{{ logsource.type }}"]
  	  type => "cdn"
          }
       ...

and then check for the tag in the filter and output sections for each log-source definition:

output {
	if "{{ logsource.name }}" in [tags] {
		graphite {
			host => "service-graphite"
			port => 2003
...

If the processing throughput I quoted (only 700 lines/second) is too low, how do I identify the bottleneck and widen it? Thanks!

magnusbaeck · June 26, 2017, 5:26am

Our logstash version is 2.3.4 -- the latest available for Ubuntu at this time.

What do you mean? Logstash isn't part of the official Ubuntu distribution and Elastic publishes Ubuntu-compatible Debian packages for each Logstash release.

700 events/s isn't too bad if you have a lot of filters, depending on what kind of machine you have. Are you saturating the CPUs? If not you can try increasing the number of pipeline workers. Some filters (like dns) add latency more than they add load so overallocating the pipeline workers makes great sense.

UnitedMarsupials · June 26, 2017, 8:22pm

It is a VM running in a very new host-server (skylake CPUs, I believe). Simply uncompressing the same log-file under takes 2 seconds. 700 lines/second seems ridiculously slow. Consider, 15 years ago, by Moore's law, computers were 1024 times slower than today -- would 0.7 lines/second be "not too bad" back in 2002?..

Removing the GeoIP filter seems to help, but we want the information it provides, so that's not an option. Same for the UserAgent filter... Oh, well, maybe, upgrading LogStash will help -- I hear, the GeoIP-filter is parallelized now.

magnusbaeck · June 27, 2017, 6:17am

It is a VM running in a very new host-server (skylake CPUs, I believe). Simply uncompressing the same log-file under takes 2 seconds. 700 lines/second seems ridiculously slow. Consider, 15 years ago, by Moore's law, computers were 1024 times slower than today -- would 0.7 lines/second be "not too bad" back in 2002?..

The CPUs don't matter if you don't saturate them. Please answer my questions instead of ranting.

UnitedMarsupials · June 27, 2017, 12:33pm

One of the CPUs out of the two seems saturated. According to documentation, the default number of "workers" is the number of processors, so it must've been two in our case. I explicitly set that to 3 and saw no changes -- only one processor is working (90-95% total load out of 200%).

The throughput has also dropped to about 355 lines per second, which we attribute to the second instance of geoip-filter being applied to each log-line (one for the client's IP, one for the CDN's). That's with null being the sole output. When feeding to the local elasticsearch (running in the same VM), it becomes much worse -- with the CPU-utilization by Logstash of only about 70-80% and the remaining 120-130% taken up by the ES-process.

Would upgrading LS, jruby, and/or Java offer a considerable improvement? Thanks!

magnusbaeck · June 27, 2017, 12:52pm

I think Logstash 5.x is supposed to be faster than 2.x, at least after some tuning, so it's worth a shot. Upgrading Java isn't likely to help. Assuming you have ample memory so that Logstash isn't garbage collecting itself to death the main parameters to tune is the batch size and the number of pipeline workers.

If you saturate the CPUs and still don't reach the throughput you want it sounds like you have an underpowered machine, at least when running both Logstash and ES. As some kind of benchmark, processing 230 events/s imposes a load on my machine of about 50-60% (also on Logstash 2.3).

system · July 25, 2017, 12:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.