Slow Data loading to elasticsearch

Hi,

I am new to ElasticSearch. I am trying to create a reporting application with Elastic search as Data Store. I get Input files, try and index it into ES via Logstash, and search / filter in ES for report output. The Issue I'm facing is, the data loading is very slow. (10k docs per minute where as in some tutorials i saw 20k apache docs per second) Obviosly im missing something here and kindly help me catch up and make the processing faster.

My system is of 8 GB RAM and 4 core processor. JVM Heap configured to 2GB for both Logstash & ES. The input file contains 30 Mil docs (Pipe seperated) and of 3.5 GB file size.Data is getting loaded to default 5 shards. It uses 4 workers(Got it with the use of metrics). The Logstash config is like the below one.

 input {
      file {
        path => "D:/aaa/sample.txt"
        type => "test"
        start_position => "beginning"
    	sincedb_path => "D:/aaa/bbb/null"    
      }
    }
    filter {
      csv {
          separator => "|"
          columns => ["1","2","3","4","5","6","7","8","9"]
      }
    }
    output {
    	elasticsearch {
            action => "index"
            hosts => [ "localhost:9200" ]
            index => "sampleindex"
        }
        stdout {codec => dots}
    }

What might be an ideal way to improve performance? I'm thinking of loading the above said file with 30 Mil records in few minutes (10 to 15).

Thanks in Advance,
Gowtham

Lots of potential things you could do, but before trying anything else remove the:

stdout {codec => dots}

and retry. stdout has always been a bottleneck for me.

1 Like

The speed improved to 12,000 docs per minute from 10,000 docs per minute. Not that much of an impact by removing stdout. Is there any other major thing I'm missing out here ?

Which version of Elasticsearch and Logstash are you using? What is the average size of a record? What is resource utilisation looking like on the node during indexing, specifically around CPU usage and disk IO?

The version of Logstash & Elastic Search is 5.4.1. A sample record from the file is below

A|ABC Enterprise Customer|10|10010|111000123456780001001353|ACDFTR|000|TT|2017-02-28

The Memory usage is around 7 GB and CPU utilization is less than 35% most of the time.

Is this a VM or a bare-metal server?

Its a Laptop with Windows 7 OS.

Can you show the graph covering disk IO in greater detail?

I'm not able to get the graph detail. But the graph is always at its peak except a sudden down as shown in the image above.

Does that indicate that performance is limited by disk I/O?

Do you mean to say that the Disk IO rate for the machine is slow which inturn affects the ES performance ? If that is so, I am able to load a tad faster in SSIS ETL tool by Microsoft in the very same machine. (1 million records in 3.5 mins)

Does SSIS ETL also index into the same Elasticsearch instance?

No, It loads Data into SQL Server Table

Guys, I am struggling to find out an solution. Please someone help me figure out the low performance issue of Logstash.

Is the SQL Server database also on your laptop? Have you increased the refresh interval on the index you are indexing into? You can also try to increase the pipeline batch-size, e.g. to 1000, to see if this makes a difference.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.