Best method - Importing 50x10gb CSV files into Elasticsearch on GCE


(Jordan Schwartz) #1

I am looking to use ES to index about 400 million records broken into 50 files with about 360 columns in each file.
Once indexed, the data will remain static. I am just looking for the best approach to load up this data initially.

The data is in CSV format. I signed up for Google Compute Engine and spun up 3 ES instances.

I attempted to use logstash locally on my mac-book and send the files to the remote ES server but I am only getting about 400 documents per second.

There has to be a better approach at loading this big data.

Any suggestions?


(Mark Walkom) #2

Logstash will do a lot more than that, try running it closer to the ES server.


(None) #3

360 column is sugesting "large" documents what's the average size of doc?

Are you using bulk?

  • You can probably disable replicas of the index until indexing is done.
  • Maybe set the refresh interval to higher then default for the index for the duration of the bulk inserts.

The rest could depend on your hardware possibly. I have to check some of the setting I did on mine.


(Jordan Schwartz) #4

I did try running it straight from the GCE VM. I was actually getting slower indexing rates (~300-400/sec).
I am using out of the box configurations. I'm assuming I should still see better results.

@javadevmtl I like the idea of disabling replicas. I'm going to try that. Here is my logstash config. See anything wrong?

input {
file {
path => "/elasticsearch/data.csv"
start_position => "beginning"
type => "data"
}
}
filter {
csv{
separator => "|"
}
}
output {

 elasticsearch {
     action => "index"
     host => "localhost"
     port => "9200"
     index => "indextest-data1" 
     workers => 2
     #cluster => "elasticsearch-cluster"
    protocol => "http"
cluster => "elasticsearch-cluster"
 }

 #stdout { codec => json }

}

Thanks for the help so far.


(None) #5

Do you only have one ES node? I would have said try to load balance your log stash request to ES node(s) you may have. The JAVA client will do that and I think you can set logstash config to use Native API instead of HTTP.

Also on ES cluster try to set...
"indices.store.throttle.max_bytes_per_sec": "200mb" For my hardware 200mb is good (Defaults to 20mb).

And on your index settings try setting...
"index.refresh_interval": "30s" (Defaults to 1s)
"index.translog.flush_threshold_size": "1000mb" <-- This again depends on your hardware (Defaults to 512mb).


(Jordan Schwartz) #6

I switched to Amazon EC2
I have 2 ES nodes in a load balanced environment.
Still getting a slow rate but I will try the config settings now


(system) #7