I've got a large (23M record) CSV file I want to upload once to ES. I've made a few small tweaks to my LS config (10 workers, batch size 5000, Xms and Xmx 5 Gb), but it still uploads only about 1000 records/s, compared to the Python ES bulk upload, which achieves 10k/sec (with a single process, and I assume, thread). Is there an obvious setting I'm missing?
I'd rather not have to write or maintain any code myself, but unless there's a quick and simple way to improve LS performance, maybe it's the better route? It will only be used for occasional one-off batch uploads.
Just the CSV filter, given the separator and columns. Everything else in the config looks standard (a single input file, and ES output). Should I paste anything in particular?
As I write the Python code to use multiprocessing and deal with error scenarios, I'm increasingly hoping I will be able to rely on Logstash, if only I can configure it easily to perform at a similar speed.
That sounds slow. How large are your events? How many columns? Is your python script running on the same hardware? What is the specification of the machine Logstash is running on?
The events are small. There are 11 columns, 10 of which are strings (250ish chars total) and one of which is an ip_range. The python script is running on the same machine as ES (as is Logstash). I've tried across three machines: a 2016 Macbook Pro and two cloud virtual machines (with 1 CPU/15GB ram and 16 CPU/240 GB ram respectively). In all cases, LS gets ~1k qps. Python bulk gets between 10k-20k.
Part of the problem might be the default logging (and the overhead of sending the console output over ssh to my machine so I can watch it and conduct my simplistic timing), but that seems unlikely to account for all of the difference.
The dissect filter was very slightly faster, and reducing batch size (to 500) slightly more still. But it's still ~10x slower than using Python. If anyone on your team has any further ideas, that'd be swell. Otherwise, onward with Python!
I just run logstash -f prefixes.conf. The logstash.yml is basically the default. And prefix-template.json just makes key_prefix an ip_range. Anything else it would be useful to know?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.