I want to use ELK for analyzing big data. I'm talking about 24 files every day, each with ~10GB of data. There are ~ 104 columns in each CSV and about 60M lines. I'm interested to upload those files to elasticsearch via logstash in less than an hour. Is it possible? my HW is HP G8 Server with 64GB RAM, 32 core CPU and a regular HD (Not SSD).
Is it possible to achieve that only by configuration tuning?
So you have 24 files, each with 60M lines. That is 1.44 billion lines and 240GB per day. To do that in a hour that would come out to around 400,000 lines per second. At that point you would also need an extremely beefy Elastic cluster to handle the corresponding load also. I don't think it would be possible on one machine.
Is there a reason you can't tail the file and have it import throughout the day? Importing that many lines over 24 hours instead of 1 hour would be 16,666 events per second which would be a bit more reasonable.
It would be easy enough to test your max throughput if you have access to the machine already. Dump one days worth of log files into a directory on the machine. Install logstash and edit the config. I would use the file input, use no filters, and then use the null output.
Then start everything up and see how long it takes to go through the entire file. Use a smaller subset of files if you don't want to wait as long. Keep in mind that adding filters(Such as KV to split up the CSV file) will slow it down. Adding in your inputs and outputs can/will also slow it down also. So this would just give you a theoretical maximum.
Once you have this benchmark you can try tweaking some of the settings to see if you can improve it from there. Also add any filters you may want to use to see how it affects the throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.