Best way to bulk insert?

linlma · July 8, 2015, 12:38am

Hello Elastic experts,

Wondering what is the best way to insert large number of records into an index?

thanks in advance,
Lin

dadoonet · July 8, 2015, 12:59am

Use bulk indeed.

If it's the first big initialization of your index, you could set number of replicas to 0 and restore it to 1 when the injection is over.

My 2 cents.

linlma · July 8, 2015, 1:15am

Thanks David,

Any bulk insert samples? New to this concept and want to follow what expert like you referred.

BTW, what is the benefit of "you could set number of replicas to 0 and restore it to 1 when the injection is over"?

Junheng_Gong · July 8, 2015, 2:45am

use bulk and parallel commit in time, increase server side bulk thread pool and queue size. disable refresh and replica.

linlma · July 8, 2015, 3:01am

Thanks Junheng_Gong for the details,

I am a new user of Elastic Search and for the tactics you mentioned, any reference samples?

regards,
Lin

Srinath_C · July 8, 2015, 3:38am

Hi @linlma,

Here what worked for me:
1, use java bulk import API
2, use async replication
3, increase index.refresh_interval to something like 10s

These factors greatly helped in enhancing our import functionality.
You will find the links to these easily on the elasticsearch documentation.

Regards,
Srinath.

linlma · July 8, 2015, 4:13am

Thanks Srinath_C,

I am not sure only Java is supported? I am using Python majorly.

regards,
Lin

Christian_Dahlqvist · July 8, 2015, 5:20am

All language clients should support bulk operations, including the official Python client, so you do not need to use Java. Here is also an example of how it is used.

linlma · July 8, 2015, 5:35am

Thanks Christian,

This is exactly what I am looking for. Do you know if there is an API to setup async replication model and increase index.refresh_interval to something like 10s? Not sure if doable for such configuration related stuff from client side.

regards,
Lin

linlma · July 10, 2015, 10:23pm

Thanks Christian,

This is exactly what I am looking for. Do you know if there is an API to setup async replication model and increase index.refresh_interval to something like 10s? Not sure if doable for such configuration related stuff from client side.

regards,
Lin

Christian_Dahlqvist · July 11, 2015, 4:01am

A list of the index settings that can be changed through the APIs can be found here. This shows how you can change the refresh interval for bulk indexing and should be supported through all language clients. If you are performing e.g. a bulk load and want to maximise performance, an option might also be to turn the number of replicas down to 0 during indexing and then increase it once the loading has completed. This reduces the amount of network traffic and can speed up loading. The drawback is naturally that you only have a single copy of the data during the indexing which makes it less resilient.

linlma · July 11, 2015, 4:40am

Thanks for the advice, Christian!

linlma · July 11, 2015, 4:42am

BTW, Christian, it seems async replication is deprecated? Thanks.

Topic		Replies	Views
Async replication deprecated Elasticsearch	7	4118	July 6, 2017
When to use the Bulk API Elasticsearch	3	671	November 20, 2017
Java client connection question Elasticsearch	2	309	July 6, 2017
Disabling indexing during bulk publishing using the java api Elasticsearch	3	1486	July 6, 2017
Bulk Uploading Elasticsearch	11	466	July 6, 2017

Best way to bulk insert?

Related topics