What is the best way for huge bulk file indexing?


(ko526so) #1

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.


(Shay Banon) #2

You need to chunk it yourself into bulk indexing requests.

On Tue, Dec 6, 2011 at 3:12 PM, ko526so kono.kim@gmail.com wrote:

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.


(Colin Surprenant) #3

For this I wrote a multithreaded writer which reads a file, bundle n
(usually 500) documents, queue the chunks which are picked up by the
writer threads which bulk index over http in round robin over all my
cluster nodes.

Now, there's a lot of tweeking that can be done to optimize
performance, see this thread for some guidelines:
https://groups.google.com/a/elasticsearch.com/group/users/msg/06d62ea3ceb4db30

Colin

On Tue, Dec 6, 2011 at 8:12 AM, ko526so kono.kim@gmail.com wrote:

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.


(David Pilato) #4

For this I wrote a multithreaded writer which reads a file, bundle n
(usually 500) documents, queue the chunks which are picked up by the
writer threads which bulk index over http in round robin over all my
cluster nodes.
Is it opensourced somewhere ?

Thanks,
David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(Colin Surprenant) #5

No its not, sorry... this code is just a part of another project. It
wouldn't be a bad idea to make this piece generic and opensource it.
It's in Ruby. If you still have interest, I'll see what I can do.

Colin

On Tue, Dec 6, 2011 at 5:20 PM, david@pilato.fr david@pilato.fr wrote:

For this I wrote a multithreaded writer which reads a file, bundle n

(usually 500) documents, queue the chunks which are picked up by the

writer threads which bulk index over http in round robin over all my

cluster nodes.

Is it opensourced somewhere ?

Thanks,

David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(David Pilato) #6

Thanks Colin. I thought it was in Java. I don't know Ruby at this time :frowning:

I don't need it by now. I was just curious on the way you implemented it.

Cheers
David

Le 6 déc. 2011 à 23:46, Colin Surprenant colin.surprenant@gmail.com a écrit :

No its not, sorry... this code is just a part of another project. It
wouldn't be a bad idea to make this piece generic and opensource it.
It's in Ruby. If you still have interest, I'll see what I can do.

Colin

On Tue, Dec 6, 2011 at 5:20 PM, david@pilato.fr david@pilato.fr wrote:

For this I wrote a multithreaded writer which reads a file, bundle n

(usually 500) documents, queue the chunks which are picked up by the

writer threads which bulk index over http in round robin over all my

cluster nodes.

Is it opensourced somewhere ?

Thanks,

David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(Karussell) #7

Thanks Colin. I thought it was in Java. I don't know Ruby at this time :frowning:

Not complicated at all

http://www.elasticsearch.org/guide/reference/java-api/bulk.html

Regards,
Peter.


(Karussell) #8

ups, ok. its the multithreaded reader which is interesting :slight_smile: ...
sorry.

On 8 Dez., 01:00, Karussell tableyourt...@googlemail.com wrote:

Thanks Colin. I thought it was in Java. I don't know Ruby at this time :frowning:

Not complicated at all

http://www.elasticsearch.org/guide/reference/java-api/bulk.html

Regards,
Peter.


(system) #9