Bulk index request dataloss

am87 · April 7, 2015, 9:16am

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · April 7, 2015, 10:10am

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afrazmamoon@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGk0XDn7KozQrWmHjW-zW89edQcDNhcxnn57JZDfqYuaw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

am87 · April 23, 2015, 10:20am

Turns out it was because the bulk thread pool queue size was too small, any
new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · April 23, 2015, 10:53am

It would mean that you are going to accumulate up to 1000 requests of 2500 docs at a time in memory.
That could be a lot. You need to monitor that. That’s a lot of objects that might be GCed at some point.

If your bulk request is rejected, why not trying to slow down injection rate instead of filling the memory?

You could also think of setting replicas to 0 before bulk and the reactivate to 1 after injection.
Having SSD drives can also help but may be you have already that?

My 2 cents

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 23 avr. 2015 à 12:20, mzrth_7810 afrazmamoon@gmail.com a écrit :

Turns out it was because the bulk thread pool queue size was too small, any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:
Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com <javascript:>> wrote:
Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a million documents, using 4 threads. Each thread is indexing at 2500 documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go higher and higher, I eventually start getting data-loss. The index ends up with less than 1,000,000 documents every time. There are no error in the logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/E4EC9E42-9DCA-410A-846F-1562B256D8DC%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

jprante · April 23, 2015, 11:15am

With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.

The ES bulk thread pool default setting is ok. Please avoid a change.

Jörg

On Thu, Apr 23, 2015 at 12:20 PM, mzrth_7810 afrazmamoon@gmail.com wrote:

Turns out it was because the bulk thread pool queue size was too small,
any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afraz...@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHx-fKLQLLqtZNM8mnupf_8n%3DMjviyx6NaqqAB7eHJFTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

am87 · April 29, 2015, 10:13am

@David, that makes sense. We have SSDs and replicas are already set to 0
while bulk indexing.

@Jörg, we haven't changed the "threadpool.bulk.size" because according to
the docs that's directly related to the number of processors available.
However "threadpool.bulk.queue_size" has been modified. I'm slowly tuning
it down to find a sweetspot, but the default
seems a but too low.

On Thursday, 23 April 2015 12:16:03 UTC+1, Jörg Prante wrote:

With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.

The ES bulk thread pool default setting is ok. Please avoid a change.

Jörg

On Thu, Apr 23, 2015 at 12:20 PM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:

Turns out it was because the bulk thread pool queue size was too small,
any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afraz...@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c531c868-19ce-4d2e-9744-d043463ed084%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Rejected execution (queue capacity 50) in bulk process Elasticsearch	11	3609	July 6, 2017
Improving Bulk Indexing Elasticsearch	12	4557	July 6, 2017
Issue Indexing 50mil Docs via Bulk API Elasticsearch	23	2438	July 5, 2017
Python vs java bulk indexing Elasticsearch	9	1899	July 6, 2017
Missing documents after a bulk index Elasticsearch	13	3426	July 6, 2017

Bulk index request dataloss

Related topics