Rapidly Degrading Bulk Indexing Performance


#1

Hi All,

We are currently attempting to optimize our configuration for a static
index of roughly 120 million records. In time, this index will probably be
much larger, but for now this is the working set. We've been playing
around with Elasticsearch for several months now, and have made great
progress with performance tuning. However, we still run into issues which
leave us scratching our heads. One such issue is an unexpected indexing
speed drop as the index grows.

We are working on an 11 node cluster. Each node has 8 CPUs and 16G of
memory. Heap size of each JVM is set to min/max of 8G. Vm.swappiness has
been set to 0 on all of the systems, as they are being used solely for
Elasticsearch. The Elasticsearch version is 0.90.7. We are focusing on
loading a single index, and it has been initialized with 48 shards, with a
refresh interval of 120 seconds. We're currently using Elasticsearch HQ
for real time monitoring of the system state, along with linux utils like
top, iotop and iftop. Everything appears to be in order.

Frequently we have to reindex the entire dataset as we are working in a
development environment and are still determining how best to structure the
dataset. We are indexing via a batch load script that fires off 10,000
record curl requests to the _bulk endpoint. We partition the entire
dataset between three servers and run the batch load script simultaneously
on each one.

At first, this appears to work great. Initial indexing speeds are roughly
50 million/hour, which would load the entire dataset in a little over 2
hours. However, once the index approaches 20 million records, indexing
performance drops significantly (down to roughly 10 million/hour). As the
index continues to grow, performance continues to degrade, and I have seen
it drop as low as less than 1 million records per hour. All in all, it
takes nearly a day to index the entire dataset of 120 million records.

I was hoping that the community might be able to offer some advice as to
what we might be doing wrong, or suggest other diagnostic approaches.
We're really trying to ratchet this system up to prepare it for production
mode, and are currently left scratching our heads. Any thoughts, opinions,
or tips would be greatly appreciated.

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #2

Couple of suggestions:

  1. Try to upgrade to 1.0.1 (or whatever is the latest)

  2. I'd probably watch the node stats (jvm memory, gc collection times,
    hot_threads, and merge stats).

This may not be your problem, but on my development machine I get similar
behavior as you and in my case, my disk was just too slow to keep up (i.e.
merging and indexing at the same time under heavy load). When I switched to
an SSD drive and I run the same exact process, all my problems simply went
away. On my spinning HDD, it would take me about 10+ hours to load up my
sample of 150M docs, but on an SSD the same exact process takes about 2
hours.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9ad9403-07a8-40cf-920e-007f8eefa1f2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #3

Couple of suggestions:

  1. Try to upgrade to 1.0.1 (or whatever is the latest)

  2. I'd probably watch the node stats (jvm memory, gc collection times,
    hot_threads, and merge stats).

This may not be your problem, but on my development machine I get similar
behavior as you and in my case, my disk was just too slow to keep up (i.e.
merging and indexing at the same time under heavy load). When I switched to
an SSD drive and I run the same exact process, all my problems simply went
away. On my spinning HDD, it would take me about 10+ hours to load up my
sample of 250M docs, but on an SSD the same exact process takes about 2
hours.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d6d97903-715f-450b-8a6a-b5011d0472c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #4

What java version are you using?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 12 March 2014 00:34, Elliott Bradshaw ebradshaw1@gmail.com wrote:

Hi All,

We are currently attempting to optimize our configuration for a static
index of roughly 120 million records. In time, this index will probably be
much larger, but for now this is the working set. We've been playing
around with Elasticsearch for several months now, and have made great
progress with performance tuning. However, we still run into issues which
leave us scratching our heads. One such issue is an unexpected indexing
speed drop as the index grows.

We are working on an 11 node cluster. Each node has 8 CPUs and 16G of
memory. Heap size of each JVM is set to min/max of 8G. Vm.swappiness has
been set to 0 on all of the systems, as they are being used solely for
Elasticsearch. The Elasticsearch version is 0.90.7. We are focusing on
loading a single index, and it has been initialized with 48 shards, with a
refresh interval of 120 seconds. We're currently using Elasticsearch HQ
for real time monitoring of the system state, along with linux utils like
top, iotop and iftop. Everything appears to be in order.

Frequently we have to reindex the entire dataset as we are working in a
development environment and are still determining how best to structure the
dataset. We are indexing via a batch load script that fires off 10,000
record curl requests to the _bulk endpoint. We partition the entire
dataset between three servers and run the batch load script simultaneously
on each one.

At first, this appears to work great. Initial indexing speeds are roughly
50 million/hour, which would load the entire dataset in a little over 2
hours. However, once the index approaches 20 million records, indexing
performance drops significantly (down to roughly 10 million/hour). As the
index continues to grow, performance continues to degrade, and I have seen
it drop as low as less than 1 million records per hour. All in all, it
takes nearly a day to index the entire dataset of 120 million records.

I was hoping that the community might be able to offer some advice as to
what we might be doing wrong, or suggest other diagnostic approaches.
We're really trying to ratchet this system up to prepare it for production
mode, and are currently left scratching our heads. Any thoughts, opinions,
or tips would be greatly appreciated.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a7oHmy%3Dbhz_QXSjnNwPKgaqUDo35Rfh9E65KUo6-taoA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


#5

Thanks Binh, Mark.

I'm using Oracle's Java 7 (1.7.0_51). I will try to upgrade to
Elasticsearch 1.0.1 if possible.

It could definitely be a disk speed issue. Unfortunately, we're working in
a virtualized environment and cannot upgrade to SSD storage.

What utility do you use for gc collection times, hot threads?

Thanks!

On Tuesday, March 11, 2014 6:13:51 PM UTC-4, Mark Walkom wrote:

What java version are you using?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 12 March 2014 00:34, Elliott Bradshaw <ebrad...@gmail.com <javascript:>

wrote:

Hi All,

We are currently attempting to optimize our configuration for a static
index of roughly 120 million records. In time, this index will probably be
much larger, but for now this is the working set. We've been playing
around with Elasticsearch for several months now, and have made great
progress with performance tuning. However, we still run into issues which
leave us scratching our heads. One such issue is an unexpected indexing
speed drop as the index grows.

We are working on an 11 node cluster. Each node has 8 CPUs and 16G of
memory. Heap size of each JVM is set to min/max of 8G. Vm.swappiness has
been set to 0 on all of the systems, as they are being used solely for
Elasticsearch. The Elasticsearch version is 0.90.7. We are focusing on
loading a single index, and it has been initialized with 48 shards, with a
refresh interval of 120 seconds. We're currently using Elasticsearch HQ
for real time monitoring of the system state, along with linux utils like
top, iotop and iftop. Everything appears to be in order.

Frequently we have to reindex the entire dataset as we are working in a
development environment and are still determining how best to structure the
dataset. We are indexing via a batch load script that fires off 10,000
record curl requests to the _bulk endpoint. We partition the entire
dataset between three servers and run the batch load script simultaneously
on each one.

At first, this appears to work great. Initial indexing speeds are
roughly 50 million/hour, which would load the entire dataset in a little
over 2 hours. However, once the index approaches 20 million records,
indexing performance drops significantly (down to roughly 10
million/hour). As the index continues to grow, performance continues to
degrade, and I have seen it drop as low as less than 1 million records per
hour. All in all, it takes nearly a day to index the entire dataset of 120
million records.

I was hoping that the community might be able to offer some advice as to
what we might be doing wrong, or suggest other diagnostic approaches.
We're really trying to ratchet this system up to prepare it for production
mode, and are currently left scratching our heads. Any thoughts, opinions,
or tips would be greatly appreciated.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e27693ac-6c6b-45f8-87cc-690f1077a49f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #6

Use a plugin like Marvel or ElasticHQ.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 12 March 2014 23:29, Elliott Bradshaw ebradshaw1@gmail.com wrote:

Thanks Binh, Mark.

I'm using Oracle's Java 7 (1.7.0_51). I will try to upgrade to
Elasticsearch 1.0.1 if possible.

It could definitely be a disk speed issue. Unfortunately, we're working
in a virtualized environment and cannot upgrade to SSD storage.

What utility do you use for gc collection times, hot threads?

Thanks!

On Tuesday, March 11, 2014 6:13:51 PM UTC-4, Mark Walkom wrote:

What java version are you using?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 12 March 2014 00:34, Elliott Bradshaw ebrad...@gmail.com wrote:

Hi All,

We are currently attempting to optimize our configuration for a static
index of roughly 120 million records. In time, this index will probably be
much larger, but for now this is the working set. We've been playing
around with Elasticsearch for several months now, and have made great
progress with performance tuning. However, we still run into issues which
leave us scratching our heads. One such issue is an unexpected indexing
speed drop as the index grows.

We are working on an 11 node cluster. Each node has 8 CPUs and 16G of
memory. Heap size of each JVM is set to min/max of 8G. Vm.swappiness has
been set to 0 on all of the systems, as they are being used solely for
Elasticsearch. The Elasticsearch version is 0.90.7. We are focusing on
loading a single index, and it has been initialized with 48 shards, with a
refresh interval of 120 seconds. We're currently using Elasticsearch HQ
for real time monitoring of the system state, along with linux utils like
top, iotop and iftop. Everything appears to be in order.

Frequently we have to reindex the entire dataset as we are working in a
development environment and are still determining how best to structure the
dataset. We are indexing via a batch load script that fires off 10,000
record curl requests to the _bulk endpoint. We partition the entire
dataset between three servers and run the batch load script simultaneously
on each one.

At first, this appears to work great. Initial indexing speeds are
roughly 50 million/hour, which would load the entire dataset in a little
over 2 hours. However, once the index approaches 20 million records,
indexing performance drops significantly (down to roughly 10
million/hour). As the index continues to grow, performance continues to
degrade, and I have seen it drop as low as less than 1 million records per
hour. All in all, it takes nearly a day to index the entire dataset of 120
million records.

I was hoping that the community might be able to offer some advice as to
what we might be doing wrong, or suggest other diagnostic approaches.
We're really trying to ratchet this system up to prepare it for production
mode, and are currently left scratching our heads. Any thoughts, opinions,
or tips would be greatly appreciated.

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e27693ac-6c6b-45f8-87cc-690f1077a49f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/e27693ac-6c6b-45f8-87cc-690f1077a49f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a4OUXjxtwwpVUBkRb7vTOxQy10%3D0Ht8PwBRYWcSX2YXw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


#7

Thanks guys.

I've made some changes to my bulk indexing. I'm now kicking off java bulk
loaders with 8 threads a piece on 3 of our 11 servers. This initially did
not help, so I went in and checked out the hot_threads in ElasticHQ.
Virtually all CPU was being allocated to building SpatialPrefixTrees! I
changed my geoshape resolution from 1KM to 10KM on the index and began
reindexing. I'm now hitting 125 million records/hour over the past 20
minutes! What's more, indexing speed has remained relatively constant over
the load!

What doesn't make sense to me is that the building of SpatialIndexTrees
should be roughly CPU constant over the course of the bulk index, and
performance was degrading dramatically as the index got bigger. Has anyone
else experienced this problem?? Where before it was taking 10-12 hours to
index the data, it will now likely finish indexing within the hour. That
seems like an awfully big difference (though it might also have to do with
the new Java loaders)...

On Wednesday, March 12, 2014 5:35:31 PM UTC-4, Mark Walkom wrote:

Use a plugin like Marvel or ElasticHQ.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com <javascript:>
web: www.campaignmonitor.com

On 12 March 2014 23:29, Elliott Bradshaw <ebrad...@gmail.com <javascript:>

wrote:

Thanks Binh, Mark.

I'm using Oracle's Java 7 (1.7.0_51). I will try to upgrade to
Elasticsearch 1.0.1 if possible.

It could definitely be a disk speed issue. Unfortunately, we're working
in a virtualized environment and cannot upgrade to SSD storage.

What utility do you use for gc collection times, hot threads?

Thanks!

On Tuesday, March 11, 2014 6:13:51 PM UTC-4, Mark Walkom wrote:

What java version are you using?

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: ma...@campaignmonitor.com
web: www.campaignmonitor.com

On 12 March 2014 00:34, Elliott Bradshaw ebrad...@gmail.com wrote:

Hi All,

We are currently attempting to optimize our configuration for a static
index of roughly 120 million records. In time, this index will probably be
much larger, but for now this is the working set. We've been playing
around with Elasticsearch for several months now, and have made great
progress with performance tuning. However, we still run into issues which
leave us scratching our heads. One such issue is an unexpected indexing
speed drop as the index grows.

We are working on an 11 node cluster. Each node has 8 CPUs and 16G of
memory. Heap size of each JVM is set to min/max of 8G. Vm.swappiness has
been set to 0 on all of the systems, as they are being used solely for
Elasticsearch. The Elasticsearch version is 0.90.7. We are focusing on
loading a single index, and it has been initialized with 48 shards, with a
refresh interval of 120 seconds. We're currently using Elasticsearch HQ
for real time monitoring of the system state, along with linux utils like
top, iotop and iftop. Everything appears to be in order.

Frequently we have to reindex the entire dataset as we are working in a
development environment and are still determining how best to structure the
dataset. We are indexing via a batch load script that fires off 10,000
record curl requests to the _bulk endpoint. We partition the entire
dataset between three servers and run the batch load script simultaneously
on each one.

At first, this appears to work great. Initial indexing speeds are
roughly 50 million/hour, which would load the entire dataset in a little
over 2 hours. However, once the index approaches 20 million records,
indexing performance drops significantly (down to roughly 10
million/hour). As the index continues to grow, performance continues to
degrade, and I have seen it drop as low as less than 1 million records per
hour. All in all, it takes nearly a day to index the entire dataset of 120
million records.

I was hoping that the community might be able to offer some advice as
to what we might be doing wrong, or suggest other diagnostic approaches.
We're really trying to ratchet this system up to prepare it for production
mode, and are currently left scratching our heads. Any thoughts, opinions,
or tips would be greatly appreciated.

Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/98958587-eaf9-4451-84ee-78c38e7eab42%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e27693ac-6c6b-45f8-87cc-690f1077a49f%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/e27693ac-6c6b-45f8-87cc-690f1077a49f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/39b1d6d0-fbb3-4ed4-b9e3-88bbb6826365%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #8