I'm trying to import some gigabytes of CSV files (approximately 100+ million rows with 5 columns) but the throughput is very low (~1mb/s). I'm not quite sure yet what the issue may be, but maybe someone here has some leads.
What throughput could I expect on an 8GB, i7 (octocore) box with an SSD for the stack and an external HDD from which the data is imported (USB 3.0, known to be readable at 200mb/s+), using the default Security Onion stack (Evaluation mode)? Are there any known throughput issues with importing CSV files using the CSV filter? Likewise for using the Date filter (to extract timestamp values from the CSV)?
On one of my servers I see a csv filter processing about 7,000 rows per second with a single worker thread. That would scale with the number of CPUs. A simple date filter is cheaper than a 5 column csv.
csv appears to be quite expensive. A stripped down regex that does not handle quoted fields gets about 3 times the throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.