Bulk indexing slow down when data amount increase


(Eric Lu) #1

Hi, guys
I'm using elasticsearch to index a large number of documents. A document
is about 0.5KB.
My elasticsearch cluster has 5 nodes(all data nodes). Each nodes are
running oracle Java version: 1.7.0_13 and both have 16GB RAM with 8GB
allocated to the JVM. And the index has 50 shards and 1 replicas.
I set the bulk thread pool to size:30 and queue:1000.
I use one thread to indexing documents by bulk, bulk size is 1000.
In the beginning, the performance is very good. It can index about 10
million documents per hour. But with the increasing of indexing document,
it slows down. When the cluster has 500 million document indexed, i noticed
that it spent about 12 hours to index 10 million documents.

Is it normal? Or what is the bottleneck that throttling it?

Any help?

Regards
Eric

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a381d703-3657-4669-8104-918d82c6c0be%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

12 hours is an absurdly long time for indexing 10 million docs.

queue:1000 is much too high for production. For test it may be ok (it
effectively disables queue rejections) but on production, you play with the
risk of starving your cluster resources.

Do you rmonitor the resource usage of ES, especially the heap? Is GC
starving your cluster? Do you see OOMs?

Do you evaluate the bulk responses for errors? Do you throttle bulk request
concurrency?

Do you set refresh interval to -1?

Hint: if 5 nodes is your maximum, you can also bulk index with 5 shards and
replica level 0, after bulk, you can increase replica level to 1.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHj9E%2BDSsoz%3DXY%3DVc0N0s5QLhb6Ea-CQq6dwQOZRZWn0A%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Eric Lu) #3

I observed the GC occured once every 15 seconds when heap mem was 75% of
the heap size. Is it too frequent? there is no OOMs.

I set refresh interval to 30s.

I'll try to use a smaller queue and set replica to 0

Thank you.

在 2014年1月13日星期一UTC+8下午8时42分56秒,Jörg Prante写道:

12 hours is an absurdly long time for indexing 10 million docs.

queue:1000 is much too high for production. For test it may be ok (it
effectively disables queue rejections) but on production, you play with the
risk of starving your cluster resources.

Do you rmonitor the resource usage of ES, especially the heap? Is GC
starving your cluster? Do you see OOMs?

Do you evaluate the bulk responses for errors? Do you throttle bulk
request concurrency?

Do you set refresh interval to -1?

Hint: if 5 nodes is your maximum, you can also bulk index with 5 shards
and replica level 0, after bulk, you can increase replica level to 1.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8ba26c0a-00cd-46ed-9610-eeb5b5f6243b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Karol Gwaj) #4

did you tried any of elasticseach health monitoring plugins
for example 'ElasticSearch HQ' have 'Node Diagnostics' option that will
point weak points of your cluster and will suggest possible solution (very
useful if you just starting your adventure with elasticsearch)
also 'bigdesk' is very good for realtime monitoring

do you have parent/child relationship configured on your documents?
it is quite often cause of high heap usage (and in consequence heavy GC'ing)

On Monday, January 13, 2014 12:22:47 PM UTC, Eric Lu wrote:

Hi, guys
I'm using elasticsearch to index a large number of documents. A document
is about 0.5KB.
My elasticsearch cluster has 5 nodes(all data nodes). Each nodes are
running oracle Java version: 1.7.0_13 and both have 16GB RAM with 8GB
allocated to the JVM. And the index has 50 shards and 1 replicas.
I set the bulk thread pool to size:30 and queue:1000.
I use one thread to indexing documents by bulk, bulk size is 1000.
In the beginning, the performance is very good. It can index about 10
million documents per hour. But with the increasing of indexing document,
it slows down. When the cluster has 500 million document indexed, i noticed
that it spent about 12 hours to index 10 million documents.

Is it normal? Or what is the bottleneck that throttling it?

Any help?

Regards
Eric

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c8a675b9-e3e0-40c2-883d-31211d1add6e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Eric Lu) #5

I have set the replica to 0 and queue to 50. and it can index about 7 - 8
millions documents per hour now. It's acceptable . Though i don't know
which change makes it.

Thank you all.

在 2014年1月13日星期一UTC+8下午9时04分35秒,Eric Lu写道:

I observed the GC occured once every 15 seconds when heap mem was 75% of
the heap size. Is it too frequent? there is no OOMs.

I set refresh interval to 30s.

I'll try to use a smaller queue and set replica to 0

Thank you.

在 2014年1月13日星期一UTC+8下午8时42分56秒,Jörg Prante写道:

12 hours is an absurdly long time for indexing 10 million docs.

queue:1000 is much too high for production. For test it may be ok (it
effectively disables queue rejections) but on production, you play with the
risk of starving your cluster resources.

Do you rmonitor the resource usage of ES, especially the heap? Is GC
starving your cluster? Do you see OOMs?

Do you evaluate the bulk responses for errors? Do you throttle bulk
request concurrency?

Do you set refresh interval to -1?

Hint: if 5 nodes is your maximum, you can also bulk index with 5 shards
and replica level 0, after bulk, you can increase replica level to 1.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8b9fab05-fa3e-455c-b8ba-1253b72c9e46%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

replica = 0 reduces the indexing workload to only the required shards, no
duplicate indexing occurs. Do not forget to increase replica level after
bulk has completed.

queue = 50 instructs a node to reject bulk requests when more than 50 bulk
requests per node are active. This saves a node from being overloaded.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHod%2BkAtHHpE_y_JdSXzgfNo_jDYcgk5pS%3DNw0tMmfNRg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #7