Why am I getting such poor performance?

(Aditya Prasad) #1

I've got a large (23M record) CSV file I want to upload once to ES. I've made a few small tweaks to my LS config (10 workers, batch size 5000, Xms and Xmx 5 Gb), but it still uploads only about 1000 records/s, compared to the Python ES bulk upload, which achieves 10k/sec (with a single process, and I assume, thread). Is there an obvious setting I'm missing?

I'd rather not have to write or maintain any code myself, but unless there's a quick and simple way to improve LS performance, maybe it's the better route? It will only be used for occasional one-off batch uploads.

(Christian Dahlqvist) #2

What does your config look like? Which filters are you using?

(Aditya Prasad) #3

Just the CSV filter, given the separator and columns. Everything else in the config looks standard (a single input file, and ES output). Should I paste anything in particular?

(Aditya Prasad) #4

As I write the Python code to use multiprocessing and deal with error scenarios, I'm increasingly hoping I will be able to rely on Logstash, if only I can configure it easily to perform at a similar speed.

(Christian Dahlqvist) #5

That sounds slow. How large are your events? How many columns? Is your python script running on the same hardware? What is the specification of the machine Logstash is running on?

(Aditya Prasad) #6

The events are small. There are 11 columns, 10 of which are strings (250ish chars total) and one of which is an ip_range. The python script is running on the same machine as ES (as is Logstash). I've tried across three machines: a 2016 Macbook Pro and two cloud virtual machines (with 1 CPU/15GB ram and 16 CPU/240 GB ram respectively). In all cases, LS gets ~1k qps. Python bulk gets between 10k-20k.

Part of the problem might be the default logging (and the overhead of sending the console output over ssh to my machine so I can watch it and conduct my simplistic timing), but that seems unlikely to account for all of the difference.

(Christian Dahlqvist) #7

I have seen reports that the dissect filter might be faster than the csv filter. Might be worth trying it to see if it makes any difference.

(Aditya Prasad) #8

Ah yes thanks, just noticed https://github.com/logstash-plugins/logstash-filter-csv/issues/46. I'll try that out.

(Christian Dahlqvist) #9

It may also be worthwhile trying with a smaller batch size. Bigger is not always better and can add pressure to your cluster.

(Aditya Prasad) #10

The dissect filter was very slightly faster, and reducing batch size (to 500) slightly more still. But it's still ~10x slower than using Python. If anyone on your team has any further ideas, that'd be swell. Otherwise, onward with Python!

(Christian Dahlqvist) #11

I am surprised the difference is so big. It would probably help if you could share you config and how you are running Logstash.

(Aditya Prasad) #12

Sure. The config is simple. This is with the CSV filter; the dissect version is basically the same:

input {
  file {
    path => "/tmp/2018-10-28-pfxs3.txt"
    start_position => "beginning"
    sincedb_path => "/dev/null"
filter {
  csv {
    separator => "	"
    columns => ["key_prefix", "match_prefix", "string_feature1, "string_feature2", "string_feature3", "string_feature4", "string_feature5", "string_feature6", "numeric_feature1", "string_feature7", "string_feature8"]
output {
   elasticsearch {
     hosts => "http://localhost:9200"
     index => "prefix"
     template => "prefix-template.json"
     template_overwrite => true     
stdout {}

I just run logstash -f prefixes.conf. The logstash.yml is basically the default. And prefix-template.json just makes key_prefix an ip_range. Anything else it would be useful to know?

(Aditya Prasad) #13

Any thoughts?

(Christian Dahlqvist) #14

No, I do not see anything that stands out.