Best method - Importing 50x10gb CSV files into Elasticsearch on GCE

Jordan_Schwartz · June 2, 2015, 4:40pm

I am looking to use ES to index about 400 million records broken into 50 files with about 360 columns in each file.
Once indexed, the data will remain static. I am just looking for the best approach to load up this data initially.

The data is in CSV format. I signed up for Google Compute Engine and spun up 3 ES instances.

I attempted to use logstash locally on my mac-book and send the files to the remote ES server but I am only getting about 400 documents per second.

There has to be a better approach at loading this big data.

Any suggestions?

warkolm · June 2, 2015, 10:10pm

Logstash will do a lot more than that, try running it closer to the ES server.

javadevmtl · June 3, 2015, 5:36am

360 column is sugesting "large" documents what's the average size of doc?

Are you using bulk?

You can probably disable replicas of the index until indexing is done.
Maybe set the refresh interval to higher then default for the index for the duration of the bulk inserts.

The rest could depend on your hardware possibly. I have to check some of the setting I did on mine.

Jordan_Schwartz · June 5, 2015, 2:37pm

I did try running it straight from the GCE VM. I was actually getting slower indexing rates (~300-400/sec).
I am using out of the box configurations. I'm assuming I should still see better results.

@javadevmtl I like the idea of disabling replicas. I'm going to try that. Here is my logstash config. See anything wrong?

input {
file {
path => "/elasticsearch/data.csv"
start_position => "beginning"
type => "data"
}
}
filter {
csv{
separator => "|"
}
}
output {

 elasticsearch {
     action => "index"
     host => "localhost"
     port => "9200"
     index => "indextest-data1" 
     workers => 2
     #cluster => "elasticsearch-cluster"
    protocol => "http"
cluster => "elasticsearch-cluster"
 }

 #stdout { codec => json }

}

Thanks for the help so far.

javadevmtl · June 5, 2015, 2:51pm

Do you only have one ES node? I would have said try to load balance your log stash request to ES node(s) you may have. The JAVA client will do that and I think you can set logstash config to use Native API instead of HTTP.

Also on ES cluster try to set...
"indices.store.throttle.max_bytes_per_sec": "200mb" For my hardware 200mb is good (Defaults to 20mb).

And on your index settings try setting...
"index.refresh_interval": "30s" (Defaults to 1s)
"index.translog.flush_threshold_size": "1000mb" <-- This again depends on your hardware (Defaults to 512mb).

Jordan_Schwartz · June 5, 2015, 8:43pm

I switched to Amazon EC2
I have 2 ES nodes in a load balanced environment.
Still getting a slow rate but I will try the config settings now

Topic		Replies	Views
Looking for advice on bulk loading Elasticsearch	6	895	July 6, 2017
Bulkload performance issue Elasticsearch	2	379	September 14, 2019
Logstash to ElasticSearch Throughput Logstash	6	1690	April 28, 2017
Import (21gb) csv to elasticsearch Elasticsearch	9	536	February 1, 2019
How to load 2 Million record of CSV file's data to Elasticsearch in CentosOs 7? Elasticsearch	7	2369	July 6, 2017

Best method - Importing 50x10gb CSV files into Elasticsearch on GCE

Related topics