Performance issues while importing CSV files into Elasticsearch

I'm trying to import some gigabytes of CSV files (approximately 100+ million rows with 5 columns) but the throughput is very low (~1mb/s). I'm not quite sure yet what the issue may be, but maybe someone here has some leads.

What throughput could I expect on an 8GB, i7 (octocore) box with an SSD for the stack and an external HDD from which the data is imported (USB 3.0, known to be readable at 200mb/s+), using the default Security Onion stack (Evaluation mode)? Are there any known throughput issues with importing CSV files using the CSV filter? Likewise for using the Date filter (to extract timestamp values from the CSV)?

When logstash starts it will log something like

[logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}

You can query that using the APIs. For example this will tell you the time spent in each part of the pipeline.

curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'

On one of my servers I see a csv filter processing about 7,000 rows per second with a single worker thread. That would scale with the number of CPUs. A simple date filter is cheaper than a 5 column csv.

csv appears to be quite expensive. A stripped down regex that does not handle quoted fields gets about 3 times the throughput.

    ruby { code => '
        m = event.get("message").scan(/([^,]+)(,|$)/)
        m.each_index { |i|
            event.set("column#{i}", m[i][0])
        }
    ' }

See this blog post also (but note that you have to use nested notation not dot notation now, so [documents][rate_1m]).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.