Bulk index request dataloss


#1

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afrazmamoon@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGk0XDn7KozQrWmHjW-zW89edQcDNhcxnn57JZDfqYuaw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


#3

Turns out it was because the bulk thread pool queue size was too small, any
new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Pilato) #4

It would mean that you are going to accumulate up to 1000 requests of 2500 docs at a time in memory.
That could be a lot. You need to monitor that. That’s a lot of objects that might be GCed at some point.

If your bulk request is rejected, why not trying to slow down injection rate instead of filling the memory?

You could also think of setting replicas to 0 before bulk and the reactivate to 1 after injection.
Having SSD drives can also help but may be you have already that?

My 2 cents

--
David Pilato - Developer | Evangelist
elastic.co
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 23 avr. 2015 à 12:20, mzrth_7810 afrazmamoon@gmail.com a écrit :

Turns out it was because the bulk thread pool queue size was too small, any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:
Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com <javascript:>> wrote:
Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a million documents, using 4 threads. Each thread is indexing at 2500 documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go higher and higher, I eventually start getting data-loss. The index ends up with less than 1,000,000 documents every time. There are no error in the logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/E4EC9E42-9DCA-410A-846F-1562B256D8DC%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #5

With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.

The ES bulk thread pool default setting is ok. Please avoid a change.

Jörg

On Thu, Apr 23, 2015 at 12:20 PM, mzrth_7810 afrazmamoon@gmail.com wrote:

Turns out it was because the bulk thread pool queue size was too small,
any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afraz...@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHx-fKLQLLqtZNM8mnupf_8n%3DMjviyx6NaqqAB7eHJFTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


#6

@David, that makes sense. We have SSDs and replicas are already set to 0
while bulk indexing.

@Jörg, we haven't changed the "threadpool.bulk.size" because according to
the docs that's directly related to the number of processors available.
However "threadpool.bulk.queue_size" has been modified. I'm slowly tuning
it down to find a sweetspot, but the default
seems a but too low.

On Thursday, 23 April 2015 12:16:03 UTC+1, Jörg Prante wrote:

With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.

The ES bulk thread pool default setting is ok. Please avoid a change.

Jörg

On Thu, Apr 23, 2015 at 12:20 PM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:

Turns out it was because the bulk thread pool queue size was too small,
any new requests were being rejected.

Is it common to set threadpool.bulk.queue_size to something like 1000 ?

On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:

Do you evaluate the bulk request responses?

Jörg

On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 afraz...@gmail.com wrote:

Hey everyone,

I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.

I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.

Taking these parameters down a notch fixes this problem.

Has anyone seen this issue before?
Is there anything that can be done about it

Thankyou

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/181a75a6-7a12-421e-9757-a82876b24a15%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/84469eb5-4fa3-480e-951c-712c2a31ff3b%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c531c868-19ce-4d2e-9744-d043463ed084%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) closed #7