Best way to bulk insert?


#1

Hello Elastic experts,

Wondering what is the best way to insert large number of records into an index?

thanks in advance,
Lin


(David Pilato) #2

Use bulk indeed.

If it's the first big initialization of your index, you could set number of replicas to 0 and restore it to 1 when the injection is over.

My 2 cents.


#3

Thanks David,

Any bulk insert samples? New to this concept and want to follow what expert like you referred.

BTW, what is the benefit of "you could set number of replicas to 0 and restore it to 1 when the injection is over"?


(Junheng Gong) #4

use bulk and parallel commit in time, increase server side bulk thread pool and queue size. disable refresh and replica.


#5

Thanks Junheng_Gong for the details,

I am a new user of Elastic Search and for the tactics you mentioned, any reference samples?

regards,
Lin


(Srinath C) #6

Hi @linlma,

Here what worked for me:
1, use java bulk import API
2, use async replication
3, increase index.refresh_interval to something like 10s

These factors greatly helped in enhancing our import functionality.
You will find the links to these easily on the elasticsearch documentation.

Regards,
Srinath.


#7

Thanks Srinath_C,

I am not sure only Java is supported? I am using Python majorly. :smile:

regards,
Lin


(Christian Dahlqvist) #8

All language clients should support bulk operations, including the official Python client, so you do not need to use Java. Here is also an example of how it is used.


#9

Thanks Christian,

This is exactly what I am looking for. Do you know if there is an API to setup async replication model and increase index.refresh_interval to something like 10s? Not sure if doable for such configuration related stuff from client side.

regards,
Lin


#10

Thanks Christian,

This is exactly what I am looking for. Do you know if there is an API to setup async replication model and increase index.refresh_interval to something like 10s? Not sure if doable for such configuration related stuff from client side.

regards,
Lin


(Christian Dahlqvist) #11

A list of the index settings that can be changed through the APIs can be found here. This shows how you can change the refresh interval for bulk indexing and should be supported through all language clients. If you are performing e.g. a bulk load and want to maximise performance, an option might also be to turn the number of replicas down to 0 during indexing and then increase it once the loading has completed. This reduces the amount of network traffic and can speed up loading. The drawback is naturally that you only have a single copy of the data during the indexing which makes it less resilient.


#12

Thanks for the advice, Christian!


#13

BTW, Christian, it seems async replication is deprecated? Thanks.


(system) #14