What is the best way for huge bulk file indexing?

ko526so · December 6, 2011, 1:12pm

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.

kimchy · December 6, 2011, 8:46pm

You need to chunk it yourself into bulk indexing requests.

On Tue, Dec 6, 2011 at 3:12 PM, ko526so kono.kim@gmail.com wrote:

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.

colinsurprenant · December 6, 2011, 10:01pm

For this I wrote a multithreaded writer which reads a file, bundle n
(usually 500) documents, queue the chunks which are picked up by the
writer threads which bulk index over http in round robin over all my
cluster nodes.

Now, there's a lot of tweeking that can be done to optimize
performance, see this thread for some guidelines:
https://groups.google.com/a/elasticsearch.com/group/users/msg/06d62ea3ceb4db30

Colin

On Tue, Dec 6, 2011 at 8:12 AM, ko526so kono.kim@gmail.com wrote:

I have to index huge volume of data frequently for research purpose.
60,000,000 docs are one of my recent task for indexing. Fortunately, the
size of docs is very small, so the total size of bulk index file for 60 M
docs is only 11 G.

I used the following command for Solr to prevent memory error and high
performance. And it was good.

curl http://localhost:8080/example/update -F stream.file=/tmp/artists.xml

Is there any similar command with ES like the above?

Thanks always.

dadoonet · December 6, 2011, 10:20pm

For this I wrote a multithreaded writer which reads a file, bundle n
(usually 500) documents, queue the chunks which are picked up by the
writer threads which bulk index over http in round robin over all my
cluster nodes.
Is it opensourced somewhere ?

Thanks,
David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

colinsurprenant · December 6, 2011, 10:46pm

No its not, sorry... this code is just a part of another project. It
wouldn't be a bad idea to make this piece generic and opensource it.
It's in Ruby. If you still have interest, I'll see what I can do.

Colin

On Tue, Dec 6, 2011 at 5:20 PM, david@pilato.fr david@pilato.fr wrote:

For this I wrote a multithreaded writer which reads a file, bundle n

(usually 500) documents, queue the chunks which are picked up by the

writer threads which bulk index over http in round robin over all my

cluster nodes.

Is it opensourced somewhere ?

Thanks,

David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

dadoonet · December 6, 2011, 10:53pm

Thanks Colin. I thought it was in Java. I don't know Ruby at this time

I don't need it by now. I was just curious on the way you implemented it.

Cheers
David

Le 6 déc. 2011 à 23:46, Colin Surprenant colin.surprenant@gmail.com a écrit :

No its not, sorry... this code is just a part of another project. It
wouldn't be a bad idea to make this piece generic and opensource it.
It's in Ruby. If you still have interest, I'll see what I can do.

Colin

On Tue, Dec 6, 2011 at 5:20 PM, david@pilato.fr david@pilato.fr wrote:

For this I wrote a multithreaded writer which reads a file, bundle n

(usually 500) documents, queue the chunks which are picked up by the

writer threads which bulk index over http in round robin over all my

cluster nodes.

Is it opensourced somewhere ?

Thanks,

David.

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet

Karussell1 · December 8, 2011, 12:00am

Thanks Colin. I thought it was in Java. I don't know Ruby at this time

Not complicated at all

Regards,
Peter.

Karussell1 · December 8, 2011, 12:01am

ups, ok. its the multithreaded reader which is interesting ...
sorry.

On 8 Dez., 01:00, Karussell tableyourt...@googlemail.com wrote:

Thanks Colin. I thought it was in Java. I don't know Ruby at this time

Not complicated at all

Elasticsearch Platform — Find real-time answers at scale | Elastic

Regards,
Peter.

Topic		Replies	Views
Indexing large number of files each with a huge size Elasticsearch	3	456	July 6, 2017
Recommendation for indexing a large size document < 1G Elasticsearch	4	5756	July 5, 2017
How to improve bulk indexing of huge amount of docs Elasticsearch	1	518	March 15, 2018
Bulk indexing size? Elasticsearch	5	329	July 6, 2017
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017

What is the best way for huge bulk file indexing?

Related topics