I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
Turns out it was because the bulk thread pool queue size was too small, any
new requests were being rejected.
Is it common to set threadpool.bulk.queue_size to something like 1000 ?
On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:
Do you evaluate the bulk request responses?
Jörg
On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:
Hey everyone,
I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
It would mean that you are going to accumulate up to 1000 requests of 2500 docs at a time in memory.
That could be a lot. You need to monitor that. That’s a lot of objects that might be GCed at some point.
If your bulk request is rejected, why not trying to slow down injection rate instead of filling the memory?
You could also think of setting replicas to 0 before bulk and the reactivate to 1 after injection.
Having SSD drives can also help but may be you have already that?
Turns out it was because the bulk thread pool queue size was too small, any new requests were being rejected.
Is it common to set threadpool.bulk.queue_size to something like 1000 ?
On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:
Do you evaluate the bulk request responses?
Jörg
On Tue, Apr 7, 2015 at 11:16 AM, mzrth_7810 <afraz...@gmail.com <javascript:>> wrote:
Hey everyone,
I've been trying to maximise my indexing rate. I'm indexing around a million documents, using 4 threads. Each thread is indexing at 2500 documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go higher and higher, I eventually start getting data-loss. The index ends up with less than 1,000,000 documents every time. There are no error in the logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.
The ES bulk thread pool default setting is ok. Please avoid a change.
I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
@David, that makes sense. We have SSDs and replicas are already set to 0
while bulk indexing.
@Jörg, we haven't changed the "threadpool.bulk.size" because according to
the docs that's directly related to the number of processors available.
However "threadpool.bulk.queue_size" has been modified. I'm slowly tuning
it down to find a sweetspot, but the default
seems a but too low.
On Thursday, 23 April 2015 12:16:03 UTC+1, Jörg Prante wrote:
With the JDBC plugin, you should slightly increase the requests per bulk
request ("maxbulkactions") in order to keep your concurrent bulk requests
low enough to get handled by ES.
The ES bulk thread pool default setting is ok. Please avoid a change.
Jörg
On Thu, Apr 23, 2015 at 12:20 PM, mzrth_7810 <afraz...@gmail.com
<javascript:>> wrote:
Turns out it was because the bulk thread pool queue size was too small,
any new requests were being rejected.
Is it common to set threadpool.bulk.queue_size to something like 1000 ?
On Tuesday, 7 April 2015 11:10:33 UTC+1, Jörg Prante wrote:
I've been trying to maximise my indexing rate. I'm indexing around a
million documents, using 4 threads. Each thread is indexing at 2500
documents per bulk request, so that's 10000 at a time.
I've been playing around with these parameters and found that if I go
higher and higher, I eventually start getting data-loss. The index ends up
with less than 1,000,000 documents every time. There are no error in the
logs, so I'm not sure what's causing this.
Taking these parameters down a notch fixes this problem.
Has anyone seen this issue before?
Is there anything that can be done about it
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.