Why am I getting such poor performance?

monk · November 6, 2018, 3:12pm

I've got a large (23M record) CSV file I want to upload once to ES. I've made a few small tweaks to my LS config (10 workers, batch size 5000, Xms and Xmx 5 Gb), but it still uploads only about 1000 records/s, compared to the Python ES bulk upload, which achieves 10k/sec (with a single process, and I assume, thread). Is there an obvious setting I'm missing?

I'd rather not have to write or maintain any code myself, but unless there's a quick and simple way to improve LS performance, maybe it's the better route? It will only be used for occasional one-off batch uploads.

Christian_Dahlqvist · November 6, 2018, 4:26pm

What does your config look like? Which filters are you using?

monk · November 6, 2018, 4:29pm

Just the CSV filter, given the separator and columns. Everything else in the config looks standard (a single input file, and ES output). Should I paste anything in particular?

monk · November 7, 2018, 5:44pm

As I write the Python code to use multiprocessing and deal with error scenarios, I'm increasingly hoping I will be able to rely on Logstash, if only I can configure it easily to perform at a similar speed.

Christian_Dahlqvist · November 7, 2018, 5:48pm

That sounds slow. How large are your events? How many columns? Is your python script running on the same hardware? What is the specification of the machine Logstash is running on?

monk · November 7, 2018, 6:03pm

The events are small. There are 11 columns, 10 of which are strings (250ish chars total) and one of which is an ip_range. The python script is running on the same machine as ES (as is Logstash). I've tried across three machines: a 2016 Macbook Pro and two cloud virtual machines (with 1 CPU/15GB ram and 16 CPU/240 GB ram respectively). In all cases, LS gets ~1k qps. Python bulk gets between 10k-20k.

Part of the problem might be the default logging (and the overhead of sending the console output over ssh to my machine so I can watch it and conduct my simplistic timing), but that seems unlikely to account for all of the difference.

Christian_Dahlqvist · November 7, 2018, 6:06pm

I have seen reports that the dissect filter might be faster than the csv filter. Might be worth trying it to see if it makes any difference.

monk · November 7, 2018, 6:08pm

Ah yes thanks, just noticed https://github.com/logstash-plugins/logstash-filter-csv/issues/46. I'll try that out.

Christian_Dahlqvist · November 7, 2018, 6:09pm

It may also be worthwhile trying with a smaller batch size. Bigger is not always better and can add pressure to your cluster.

monk · November 7, 2018, 11:35pm

The dissect filter was very slightly faster, and reducing batch size (to 500) slightly more still. But it's still ~10x slower than using Python. If anyone on your team has any further ideas, that'd be swell. Otherwise, onward with Python!

Christian_Dahlqvist · November 8, 2018, 2:56am

I am surprised the difference is so big. It would probably help if you could share you config and how you are running Logstash.

monk · November 8, 2018, 4:05pm

Sure. The config is simple. This is with the CSV filter; the dissect version is basically the same:

input {
  file {
    path => "/tmp/2018-10-28-pfxs3.txt"
    start_position => "beginning"
    sincedb_path => "/dev/null"
  }
}
filter {
  csv {
    separator => "	"
    columns => ["key_prefix", "match_prefix", "string_feature1, "string_feature2", "string_feature3", "string_feature4", "string_feature5", "string_feature6", "numeric_feature1", "string_feature7", "string_feature8"]
  }
}
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "prefix"
     template => "prefix-template.json"
     template_overwrite => true     
  }
stdout {}
}

I just run logstash -f prefixes.conf. The logstash.yml is basically the default. And prefix-template.json just makes key_prefix an ip_range. Anything else it would be useful to know?

monk · November 20, 2018, 4:35pm

Any thoughts?

Christian_Dahlqvist · November 20, 2018, 4:39pm

No, I do not see anything that stands out.

system · December 18, 2018, 4:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CSV files slow input/fitering comparing to python script Logstash	1	80	April 3, 2024
Logstash improve time performance Logstash	13	644	April 12, 2018
Help Needed in improving the data ingestion time Logstash	5	751	December 13, 2017
Performance issues while importing CSV files into Elasticsearch Logstash	2	774	September 6, 2018
Problem Performance Elasticsearch Elasticsearch	16	1065	April 22, 2017

Why am I getting such poor performance?

Related topics