Indexing 570 millions rows


(Joao Otavio Sakai Genari) #1

Hello my name is João Sakai and I'm in the middle of the greatest challenge of my life;

"Using a logstash I have to index one CSV with 46 columns and 570 millions rows on elasticsearch as soon as possible"

Environment Configurations:

Logstash Configuration


Amazon Instance Type: m3.medium

Logstash config:

Index Template Config:


Note:
As you can see I already do the optimization for " number of replicas: 0 " and " refresh interval: -1 ";


**Elasticsearch Configuration**

Amazon Instance Type: m3.2xlarge

Cluster Config:


Note:
I'm following the Indexing performance guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html


My results were really disappointing! :cry:

Executing the index process in 1 hour the total of indexed documents was only 2 millions which leads me to think that there is something wrong about logstash configuration, elasticsearch configuration or anything else.

Is there some configuration wrong? Have I change the ec2 instances configurations?

Someone could give me some insights about how to index a large bulk of data?


(Ed) #2

how many data nodes do you have spawned.

A quick test for logstash would be to send the data output to /dev/null just to see the CSV get read, parsed and outputed without disk speeds
replace your elasticsearch section with

output{
file{
path => "/dev/null"
}
}

for example I index about 10K docuements per second on 9 data nodes, each with 24 cpu's and 30GB heap, and EMC SAN. but my documents are also pretty complex.


(Ed) #3

Oh and I would install marvel to see your cluster's performance see where it is bottle necking


(Anh) #4

You should remove the stdout { codec => json }. It should be enabled only when you are debugging. Having stdout slows processing down a lot.

In addition to @eperry suggestion, you can use the logstash metrics plugin https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html to measure messages through LS.

I think that the flush_size => 100 is a bit low. The default one is already 125.

The size of your csv file also affects read speed.


(system) #5