I am looking to use ES to index about 400 million records broken into 50 files with about 360 columns in each file.
Once indexed, the data will remain static. I am just looking for the best approach to load up this data initially.
The data is in CSV format. I signed up for Google Compute Engine and spun up 3 ES instances.
I attempted to use logstash locally on my mac-book and send the files to the remote ES server but I am only getting about 400 documents per second.
There has to be a better approach at loading this big data.
I did try running it straight from the GCE VM. I was actually getting slower indexing rates (~300-400/sec).
I am using out of the box configurations. I'm assuming I should still see better results.
@javadevmtl I like the idea of disabling replicas. I'm going to try that. Here is my logstash config. See anything wrong?
Do you only have one ES node? I would have said try to load balance your log stash request to ES node(s) you may have. The JAVA client will do that and I think you can set logstash config to use Native API instead of HTTP.
Also on ES cluster try to set...
"indices.store.throttle.max_bytes_per_sec": "200mb" For my hardware 200mb is good (Defaults to 20mb).
And on your index settings try setting...
"index.refresh_interval": "30s" (Defaults to 1s)
"index.translog.flush_threshold_size": "1000mb" <-- This again depends on your hardware (Defaults to 512mb).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.