Indexing 570 millions rows

jgenari · October 21, 2016, 6:51pm

Hello my name is João Sakai and I'm in the middle of the greatest challenge of my life;

"Using a logstash I have to index one CSV with 46 columns and 570 millions rows on elasticsearch as soon as possible"

Environment Configurations:

Logstash Configuration

Amazon Instance Type: m3.medium

Logstash config:

Index Template Config:

Note:
As you can see I already do the optimization for " number of replicas: 0 " and " refresh interval: -1 ";

**Elasticsearch Configuration**

Amazon Instance Type: m3.2xlarge

Cluster Config:

Note:
I'm following the Indexing performance guide: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html

My results were really disappointing!

Executing the index process in 1 hour the total of indexed documents was only 2 millions which leads me to think that there is something wrong about logstash configuration, elasticsearch configuration or anything else.

Is there some configuration wrong? Have I change the ec2 instances configurations?

Someone could give me some insights about how to index a large bulk of data?

eperry · October 21, 2016, 7:35pm

how many data nodes do you have spawned.

A quick test for logstash would be to send the data output to /dev/null just to see the CSV get read, parsed and outputed without disk speeds
replace your elasticsearch section with

output{
file{
path => "/dev/null"
}
}

for example I index about 10K docuements per second on 9 data nodes, each with 24 cpu's and 30GB heap, and EMC SAN. but my documents are also pretty complex.

eperry · October 21, 2016, 7:36pm

Oh and I would install marvel to see your cluster's performance see where it is bottle necking

anhlqn · October 21, 2016, 11:20pm

You should remove the stdout { codec => json }. It should be enabled only when you are debugging. Having stdout slows processing down a lot.

In addition to @eperry suggestion, you can use the logstash metrics plugin https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html to measure messages through LS.

I think that the flush_size => 100 is a bit low. The default one is already 125.

The size of your csv file also affects read speed.

Topic		Replies	Views
Best method - Importing 50x10gb CSV files into Elasticsearch on GCE Elasticsearch	6	8951	July 6, 2017
Loading high transactional data to elasticsearch Logstash	22	250	December 12, 2024
Logstash performance for indexing to ES Logstash	6	677	April 18, 2017
Increasing elasticsearch indexing rate Elasticsearch	14	12919	March 9, 2017
Elastic search Tuning Performance Elasticsearch	17	892	February 28, 2019

Indexing 570 millions rows

Related topics