Indexing 570 millions rows

(Joao Otavio Sakai Genari) #1

Hello my name is João Sakai and I'm in the middle of the greatest challenge of my life;

"Using a logstash I have to index one CSV with 46 columns and 570 millions rows on elasticsearch as soon as possible"

Environment Configurations:

Logstash Configuration

Amazon Instance Type: m3.medium

Logstash config:

Index Template Config:

As you can see I already do the optimization for " number of replicas: 0 " and " refresh interval: -1 ";

**Elasticsearch Configuration**

Amazon Instance Type: m3.2xlarge

Cluster Config:

I'm following the Indexing performance guide:

My results were really disappointing! :cry:

Executing the index process in 1 hour the total of indexed documents was only 2 millions which leads me to think that there is something wrong about logstash configuration, elasticsearch configuration or anything else.

Is there some configuration wrong? Have I change the ec2 instances configurations?

Someone could give me some insights about how to index a large bulk of data?

(Ed) #2

how many data nodes do you have spawned.

A quick test for logstash would be to send the data output to /dev/null just to see the CSV get read, parsed and outputed without disk speeds
replace your elasticsearch section with

path => "/dev/null"

for example I index about 10K docuements per second on 9 data nodes, each with 24 cpu's and 30GB heap, and EMC SAN. but my documents are also pretty complex.

(Ed) #3

Oh and I would install marvel to see your cluster's performance see where it is bottle necking

(Anh) #4

You should remove the stdout { codec => json }. It should be enabled only when you are debugging. Having stdout slows processing down a lot.

In addition to @eperry suggestion, you can use the logstash metrics plugin to measure messages through LS.

I think that the flush_size => 100 is a bit low. The default one is already 125.

The size of your csv file also affects read speed.

(system) #5