I have 50 million records to import into ES.
What sort of configuration would I need for this, how many nodes, clusters, etc? I am using NEST API for C#.
How can I monitor how optimally the import is being done?
I have 50 million records to import into ES.
What sort of configuration would I need for this, how many nodes, clusters, etc? I am using NEST API for C#.
How can I monitor how optimally the import is being done?
It really depends on lots of things. Size of the documents, hardware, etc. You should try it. I'd start with 1 machine for that.
You can read the log for things like merges being throttled.
You can check the stack traces for things you don't expect.
Mostly, though, you should be testing different batch sizes for the bulk api and trying to find a good one for your data.
I've had lots of success tuning the refresh_interval
interval on the index up to 30s
or -1
(infinity) when doing bulk imports. That speeds them up a ton. See this for how to set it. If you set it to -1
you have to manually call _refresh
after the process is done.
hi, check these settings:
index.translog.flush_threshold_ops
index.translog.flush_threshold_size
while configurating nodes do not forget to check "bootstrap.mlockall: true" )))
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.