Bulk indexing size?

DH1 · May 7, 2013, 9:47am

Hi, everyone.

Yet another day playing with that awesome product.
I have a question regarding bulk indexing.
Right now, we have three nodes, running with 22Gb ram devoted to ES
Our docs are big, with between 300-500 fields (let's say an average of
400), several nested structures and many analyzed strings.
We are storing the source, but are not indexing the _all field.
We have a batch of indexation in JAVA, in wich we use, of course, the bulk
api to increase our performances.
Right now, ours bulks, due to factors outside of our control, may vary in
size, between 2000 and 5000 of these docs per bulk.
Of course, the refresh_interval is disabled (-1)
Our performances lies somewhere between 2 and 4 minutes per bulk.
I read a lot about indexing speed of several thousand docs per sec, and we
are pretty far form there,

So, is there something we are doing wrong? Are those times dues to the
complexity of our doc?
Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 7, 2013, 10:02am

Are you fetching your docs from an external database?
Are you sure that in your process, Bulk is taking all the duration?

Can you share a little your code?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mai 2013 à 11:47, DH ciddp195@gmail.com a écrit :

Hi, everyone.

Yet another day playing with that awesome product.
I have a question regarding bulk indexing.
Right now, we have three nodes, running with 22Gb ram devoted to ES
Our docs are big, with between 300-500 fields (let's say an average of 400), several nested structures and many analyzed strings.
We are storing the source, but are not indexing the _all field.
We have a batch of indexation in JAVA, in wich we use, of course, the bulk api to increase our performances.
Right now, ours bulks, due to factors outside of our control, may vary in size, between 2000 and 5000 of these docs per bulk.
Of course, the refresh_interval is disabled (-1)
Our performances lies somewhere between 2 and 4 minutes per bulk.
I read a lot about indexing speed of several thousand docs per sec, and we are pretty far form there,

So, is there something we are doing wrong? Are those times dues to the complexity of our doc?
Thanks in advance!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

DH1 · May 7, 2013, 11:36am

Hi, David, and thanks for your answer.

Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit, and
then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.

The number of docs in the bulk vary because each doc have to be indexed
between 2-5 time, each time with altered fields, and within 2 different
indices.

Those 2000-5000 docs always come from 1000 original docs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · May 7, 2013, 1:09pm

So? Where is the cost?
When building the bulk?
When executing the bulk?

Can you share your numbers (time spent on building vs on executing)?

May be you are running out of memory space on client side?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mai 2013 à 13:36, DH ciddp195@gmail.com a écrit :

Hi, David, and thanks for your answer.

Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit, and then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.

The number of docs in the bulk vary because each doc have to be indexed between 2-5 time, each time with altered fields, and within 2 different indices.

Those 2000-5000 docs always come from 1000 original docs.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · May 7, 2013, 11:01pm

Are you bulk indexing to a new index? If so, did you remove all replicas?

--
Ivan

On Tue, May 7, 2013 at 6:09 AM, David Pilato david@pilato.fr wrote:

So? Where is the cost?
When building the bulk?
When executing the bulk?

Can you share your numbers (time spent on building vs on executing)?

May be you are running out of memory space on client side?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 7 mai 2013 à 13:36, DH ciddp195@gmail.com a écrit :

Hi, David, and thanks for your answer.

Well, the code is pretty straightforward.
It's a loop that read a JSON in a noSQL database, fiddle with it a bit,
and then put it in a bulk.
After the loop, the bulk is executed - rinse and repeat.

The number of docs in the bulk vary because each doc have to be indexed
between 2-5 time, each time with altered fields, and within 2 different
indices.

Those 2000-5000 docs always come from 1000 original docs.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Questions --- Regarding to Size of Bulk Load., Capasity of a Shards and Performance Elasticsearch	3	367	July 6, 2017
Java bulk API slows down if client is not closed and reopened Elasticsearch	9	520	July 6, 2017
Slowly Indexing speed Elasticsearch	26	861	August 18, 2020
Bulk indexing slow down when data amount increase Elasticsearch	6	2956	July 6, 2017
Inserts get slower when index become large Elasticsearch	10	433	July 6, 2017

Bulk indexing size?

Related topics