Spikey Index Rate

I've noticed that our index rate is very bursty and as of yet I've not been
able to figure out quite why. In the included picture from paramedic, the
times when our index rate is great are the large chunks of dark green
across each machine. Then, indexing will come to a complete halt for a
while (anything from several minutes to several hours), and it's usually
characterized by one or two machines using one core, and almost no CPU
activity on any of the other machines.

I had a suspicion that it was waiting for periodic shard merges, because
these slow periods were sometimes accompanied with a drop in the number of
deleted documents. However, this is not always the case, and so I'm
skeptical that's the cause.

The slow periods do correspond strongly to indices.indexing.index_current
(provided from the _node/stats API), but I'm a little unclear what exactly
this means. It seems like it's a measure of time the node is spending
indexing content, but what is odd to me is that it would be so spikey. This
time also doesn't seem to correspond to heightened I/O.

Any chance this looks familiar to anyone? (http://imgur.com/W6l9X)

https://lh5.googleusercontent.com/-t9woff0ykYQ/UL5W25pHL_I/AAAAAAAAAAM/B4Sy6ak0lzw/s1600/Paramedic+Indexing+Speed+Issue.jpg

--

Hello Dan,

I would suspect it's the Garbage Cleaner, and I think you can confirm/deny
that by looking at your cluster with other tools, like our SPM:

How much RAM do you have on your nodes and how much of that do you allocate
to ES?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 4, 2012 at 10:03 PM, Dan Lecocq dan@seomoz.org wrote:

I've noticed that our index rate is very bursty and as of yet I've not
been able to figure out quite why. In the included picture from paramedic,
the times when our index rate is great are the large chunks of dark green
across each machine. Then, indexing will come to a complete halt for a
while (anything from several minutes to several hours), and it's usually
characterized by one or two machines using one core, and almost no CPU
activity on any of the other machines.

I had a suspicion that it was waiting for periodic shard merges, because
these slow periods were sometimes accompanied with a drop in the number of
deleted documents. However, this is not always the case, and so I'm
skeptical that's the cause.

The slow periods do correspond strongly to indices.indexing.index_current
(provided from the _node/stats API), but I'm a little unclear what exactly
this means. It seems like it's a measure of time the node is spending
indexing content, but what is odd to me is that it would be so spikey. This
time also doesn't seem to correspond to heightened I/O.

Any chance this looks familiar to anyone? (http://imgur.com/W6l9X)

--

Hi Radu,

I had suspected garbage collection, but I'm not so sure now. I examined the
'slow' nodes during these periods with bigdesk, and I found that there
seemingly wasn't a lot of garbage collection going on. The used heap size
was staying relatively constant, and there wasn't any discernable
difference between these nodes and the others. Even on the same node
between when they were indexing slowly and when they were indexing quickly.
FWIW, these nodes have 15GB, of which I've allocated 8GB for ES.

I did notice, however, that the hot_threads endpoint indicated that most
of the time was spent in the UAX tokenizer
(https://gist.github.com/4217706). This made me think that perhaps it's
more a function of the documents I'm inserting than configuration. So, I
switched from bulk indexing to single-document indexing, hoping to find a
handful of documents that were slow. Oddly enough, though, this spikey
behavior went away when I stopped using the bulk index. I tried tinkering
the with the bulk size (using the pyes default of 400), and it seems that
no matter what bulk size I choose, this makes the issue emerge.

Of course, the single-doc indexing is significantly slower, so I'd still
like to figure out what the issue is.

Thoughts? It seems odd to me that slow indexing on a single node would
block the rest of the cluster from accepting index requests, or that
switching from bulk / single-document would exacerbate the issue.

On Wednesday, December 5, 2012 3:31:22 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

I would suspect it's the Garbage Cleaner, and I think you can confirm/deny
that by looking at your cluster with other tools, like our SPM:
http://sematext.com/spm/elasticsearch-performance-monitoring/index.html

How much RAM do you have on your nodes and how much of that do you
allocate to ES?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 4, 2012 at 10:03 PM, Dan Lecocq <d...@seomoz.org <javascript:>

wrote:

I've noticed that our index rate is very bursty and as of yet I've not
been able to figure out quite why. In the included picture from paramedic,
the times when our index rate is great are the large chunks of dark green
across each machine. Then, indexing will come to a complete halt for a
while (anything from several minutes to several hours), and it's usually
characterized by one or two machines using one core, and almost no CPU
activity on any of the other machines.

I had a suspicion that it was waiting for periodic shard merges, because
these slow periods were sometimes accompanied with a drop in the number of
deleted documents. However, this is not always the case, and so I'm
skeptical that's the cause.

The slow periods do correspond strongly to indices.indexing.index_current
(provided from the _node/stats API), but I'm a little unclear what exactly
this means. It seems like it's a measure of time the node is spending
indexing content, but what is odd to me is that it would be so spikey. This
time also doesn't seem to correspond to heightened I/O.

Any chance this looks familiar to anyone? (http://imgur.com/W6l9X)

--

I finally figured out my issue. Some of the fields we're inserting are
urls, but it turns out that some of the urls we were inserting were data
urls (http://en.wikipedia.org/wiki/Data_URI_scheme), and omitting those has
evened out performance significantly. Now things are running as smoothly as
I could have hoped.

Thanks for the input / suggestions!

On Wednesday, December 5, 2012 9:30:55 AM UTC-8, Dan Lecocq wrote:

Hi Radu,

I had suspected garbage collection, but I'm not so sure now. I examined
the 'slow' nodes during these periods with bigdesk, and I found that there
seemingly wasn't a lot of garbage collection going on. The used heap size
was staying relatively constant, and there wasn't any discernable
difference between these nodes and the others. Even on the same node
between when they were indexing slowly and when they were indexing quickly.
FWIW, these nodes have 15GB, of which I've allocated 8GB for ES.

I did notice, however, that the hot_threads endpoint indicated that most
of the time was spent in the UAX tokenizer (
https://gist.github.com/4217706). This made me think that perhaps it's
more a function of the documents I'm inserting than configuration. So, I
switched from bulk indexing to single-document indexing, hoping to find a
handful of documents that were slow. Oddly enough, though, this spikey
behavior went away when I stopped using the bulk index. I tried tinkering
the with the bulk size (using the pyes default of 400), and it seems that
no matter what bulk size I choose, this makes the issue emerge.

Of course, the single-doc indexing is significantly slower, so I'd still
like to figure out what the issue is.

Thoughts? It seems odd to me that slow indexing on a single node would
block the rest of the cluster from accepting index requests, or that
switching from bulk / single-document would exacerbate the issue.

On Wednesday, December 5, 2012 3:31:22 AM UTC-8, Radu Gheorghe wrote:

Hello Dan,

I would suspect it's the Garbage Cleaner, and I think you can
confirm/deny that by looking at your cluster with other tools, like our SPM:
http://sematext.com/spm/elasticsearch-performance-monitoring/index.html

How much RAM do you have on your nodes and how much of that do you
allocate to ES?

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, Dec 4, 2012 at 10:03 PM, Dan Lecocq d...@seomoz.org wrote:

I've noticed that our index rate is very bursty and as of yet I've not
been able to figure out quite why. In the included picture from paramedic,
the times when our index rate is great are the large chunks of dark green
across each machine. Then, indexing will come to a complete halt for a
while (anything from several minutes to several hours), and it's usually
characterized by one or two machines using one core, and almost no CPU
activity on any of the other machines.

I had a suspicion that it was waiting for periodic shard merges, because
these slow periods were sometimes accompanied with a drop in the number of
deleted documents. However, this is not always the case, and so I'm
skeptical that's the cause.

The slow periods do correspond strongly to
indices.indexing.index_current (provided from the _node/stats API), but I'm
a little unclear what exactly this means. It seems like it's a measure of
time the node is spending indexing content, but what is odd to me is that
it would be so spikey. This time also doesn't seem to correspond to
heightened I/O.

Any chance this looks familiar to anyone? (http://imgur.com/W6l9X)

--