Suspect GC sync'ed between nodes cause simultaneous performance hit

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed
by waiting on elastic search. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both
at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur,
why they occur simultaneously, and what we can do to mitigate/eliminate the
impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

There isn't a way to control when a GC occurs, but we can try and understand why they occur. How do you monitor the GC count? It would be interesting to know which type of GC spiked, the ParNew or the CMS (if you are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has run, and the time it took?

Also, can you check the node stats and check the field cache and filter cache memory size? Do they increase? Does the filter cache occupies a lot of memory?

What might happen every 4-8 hours in terms of the system? Is there something that might create such load? One such reason can be if you provide a front facing web site, and allow to paginate through all the results (which might be really big resultset), and a search crawler might do that.

-shay.banon
On Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:
Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed by waiting on Elasticsearch. According to our monitoring, garbage collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both at the same time. GC count (and time) gradually increase (as you'd expect)... but with spikes occurring, typically 4-8 hours apart. These spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur, why they occur simultaneously, and what we can do to mitigate/eliminate the impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png

  • elastic_search_gc_count_es1.png

  • elastic_search_gc_count_es2.png

FYI, our GC performance issues were fixed up by enabling mlockall from
ES.

On Mar 15, 1:19 am, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and understand why they occur. How do you monitor the GC count? It would be interesting to know which type of GC spiked, the ParNew or the CMS (if you are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has run, and the time it took?

Also, can you check the node stats and check the field cache and filter cache memory size? Do they increase? Does the filter cache occupies a lot of memory?

What might happen every 4-8 hours in terms of the system? Is there something that might create such load? One such reason can be if you provide a front facing web site, and allow to paginate through all the results (which might be really big resultset), and a search crawler might do that.

-shay.banonOn Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed by waiting on Elasticsearch. According to our monitoring, garbage collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both at the same time. GC count (and time) gradually increase (as you'd expect)... but with spikes occurring, typically 4-8 hours apart. These spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur, why they occur simultaneously, and what we can do to mitigate/eliminate the impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

Shay, is this something that you recommend

On Tue, Mar 15, 2011 at 1:07 PM, Paul ppearcy@gmail.com wrote:

FYI, our GC performance issues were fixed up by enabling mlockall from
ES.

On Mar 15, 1:19 am, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and
understand why they occur. How do you monitor the GC count? It would be
interesting to know which type of GC spiked, the ParNew or the CMS (if you
are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has
    run, and the time it took?

Also, can you check the node stats and check the field cache and filter
cache memory size? Do they increase? Does the filter cache occupies a lot of
memory?

What might happen every 4-8 hours in terms of the system? Is there
something that might create such load? One such reason can be if you provide
a front facing web site, and allow to paginate through all the results
(which might be really big resultset), and a search crawler might do that.

-shay.banonOn Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions
slowed by waiting on Elasticsearch. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and
both at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes
occur, why they occur simultaneously, and what we can do to
mitigate/eliminate the impact (eg, do they have to do whatever it is they're
doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

--

Brad Birnbaum

CTO, Assistly

brad@assistly.com m: 516-308-2286

Yes, using mlockall is recommended, just to be on hte safe side of not allowing to OS to ever swap the ES process to disk. Make sure though that ulimits are set properly to allow to lock the amount of memory allocated to ES, and that MIN and MAX mem have the same value.
On Tuesday, March 15, 2011 at 8:22 PM, Brad Birnbaum wrote:

Shay, is this something that you recommend

On Tue, Mar 15, 2011 at 1:07 PM, Paul ppearcy@gmail.com wrote:

FYI, our GC performance issues were fixed up by enabling mlockall from
ES.

On Mar 15, 1:19 am, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and understand why they occur. How do you monitor the GC count? It would be interesting to know which type of GC spiked, the ParNew or the CMS (if you are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has run, and the time it took?

Also, can you check the node stats and check the field cache and filter cache memory size? Do they increase? Does the filter cache occupies a lot of memory?

What might happen every 4-8 hours in terms of the system? Is there something that might create such load? One such reason can be if you provide a front facing web site, and allow to paginate through all the results (which might be really big resultset), and a search crawler might do that.

-shay.banonOn Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed by waiting on Elasticsearch. According to our monitoring, garbage collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both at the same time. GC count (and time) gradually increase (as you'd expect)... but with spikes occurring, typically 4-8 hours apart. These spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur, why they occur simultaneously, and what we can do to mitigate/eliminate the impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

--
Brad Birnbaum
CTO, Assistly
brad@assistly.com m: 516-308-2286

On Tue, Mar 15, 2011 at 6:19 PM, Shay Banon shay.banon@elasticsearch.comwrote:

There isn't a way to control when a GC occurs, but we can try and
understand why they occur. How do you monitor the GC count? It would be
interesting to know which type of GC spiked, the ParNew or the CMS (if you
are using the node stats, then you will see it there).

Hi Shay,

We're using http://localhost:9200/_cluster/health to pull out
jvm->gc->collection_count for each node. The graphs I attached show those
counts for each node and side-by-side for both nodes.

Some basic questions:

  • What operating system are you using?

We're using Ubuntu 10.10 (Maverick) x64, running on EC2 m1.large instances.

  • Which JVM version are you using?

java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

  • How much memory is allocated to the JVM?

We're using the service wrapper with:

set.default.ES_MIN_MEM=8192
set.default.ES_MAX_MEM=8192
wrapper.java.initmemory=%ES_MIN_MEM%
wrapper.java.maxmemory=%ES_MAX_MEM%

  • Do you see in elasticsearch log a log statement indicating a GC has run,
    and the time it took?

No, we have GC logging disabled in production, so we're inferring that rapid
increases in GC counts indicate the times that GC has run.

Also, can you check the node stats and check the field cache and filter

cache memory size? Do they increase? Does the filter cache occupies a lot of
memory?

I noticed different memory sizes between the nodes, but I assume that can
simply be due to how they're used?

There are a high number of evictions on the master node. However, I checked
these stats yesterday and today, more than 12 hours apart, and the evictions
have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},
. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

What might happen every 4-8 hours in terms of the system? Is there
something that might create such load? One such reason can be if you provide
a front facing web site, and allow to paginate through all the results
(which might be really big resultset), and a search crawler might do that.

That's a very good question. The period between each GC run (or, at least,
what I'm assuming is a GC run) varies. I can't see any cron jobs that would
cause it and I wouldn't expect to with the varying times.

I'll check the logs for search crawlers around those times and I'll ask the
guys if they have any ideas.

Thanks for your help, Shay.

Cheers,
David.

On Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed
by waiting on Elasticsearch. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both
at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes
occur, why they occur simultaneously, and what we can do to
mitigate/eliminate the impact (eg, do they have to do whatever it is they're
doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

Hi Shay,

I read Elasticsearch Platform — Find real-time answers at scale | Elastic and
can see that it's to prevent the process being swapped out. However, I'm
not clear, what effect does this have on GC? Presumably the address space
is still the same size and GC still needs to occur.

Thanks!

ps. We aim not to use swap and make sure we have more memory than needs to
be allocated.

Cheers,
David.

On Wed, Mar 16, 2011 at 6:24 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Yes, using mlockall is recommended, just to be on hte safe side of not
allowing to OS to ever swap the ES process to disk. Make sure though that
ulimits are set properly to allow to lock the amount of memory allocated to
ES, and that MIN and MAX mem have the same value.

On Tuesday, March 15, 2011 at 8:22 PM, Brad Birnbaum wrote:

Shay, is this something that you recommend

On Tue, Mar 15, 2011 at 1:07 PM, Paul ppearcy@gmail.com wrote:

FYI, our GC performance issues were fixed up by enabling mlockall from
ES.

On Mar 15, 1:19 am, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and
understand why they occur. How do you monitor the GC count? It would be
interesting to know which type of GC spiked, the ParNew or the CMS (if you
are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has
    run, and the time it took?

Also, can you check the node stats and check the field cache and filter
cache memory size? Do they increase? Does the filter cache occupies a lot of
memory?

What might happen every 4-8 hours in terms of the system? Is there
something that might create such load? One such reason can be if you provide
a front facing web site, and allow to paginate through all the results
(which might be really big resultset), and a search crawler might do that.

-shay.banonOn Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions
slowed by waiting on Elasticsearch. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and
both at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes
occur, why they occur simultaneously, and what we can do to
mitigate/eliminate the impact (eg, do they have to do whatever it is they're
doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

--

Brad Birnbaum

CTO, Assistly

brad@assistly.com m: 516-308-2286

I have a sneaky suspicion based on the node stats that it might be because of excessive filter caching. Which version are you using? In 0.15.2, I fixed a wrong double caching of filters when using filter facets.

Where do you typically use filters? And what type of filters do you use?
On Wednesday, March 16, 2011 at 12:35 AM, David Taylor wrote:
On Tue, Mar 15, 2011 at 6:19 PM, Shay Banon shay.banon@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and understand why they occur. How do you monitor the GC count? It would be interesting to know which type of GC spiked, the ParNew or the CMS (if you are using the node stats, then you will see it there).

Hi Shay,

We're using http://localhost:9200/_cluster/health to pull out jvm->gc->collection_count for each node. The graphs I attached show those counts for each node and side-by-side for both nodes.

Some basic questions:

  • What operating system are you using?

We're using Ubuntu 10.10 (Maverick) x64, running on EC2 m1.large instances.

  • Which JVM version are you using?

java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

  • How much memory is allocated to the JVM?

We're using the service wrapper with:

set.default.ES_MIN_MEM=8192
set.default.ES_MAX_MEM=8192

wrapper.java.initmemory=%ES_MIN_MEM%
wrapper.java.maxmemory=%ES_MAX_MEM%

  • Do you see in elasticsearch log a log statement indicating a GC has run, and the time it took?

No, we have GC logging disabled in production, so we're inferring that rapid increases in GC counts indicate the times that GC has run.

Also, can you check the node stats and check the field cache and filter cache memory size? Do they increase? Does the filter cache occupies a lot of memory?

I noticed different memory sizes between the nodes, but I assume that can simply be due to how they're used?

There are a high number of evictions on the master node. However, I checked these stats yesterday and today, more than 12 hours apart, and the evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},

. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

What might happen every 4-8 hours in terms of the system? Is there something that might create such load? One such reason can be if you provide a front facing web site, and allow to paginate through all the results (which might be really big resultset), and a search crawler might do that.

That's a very good question. The period between each GC run (or, at least, what I'm assuming is a GC run) varies. I can't see any cron jobs that would cause it and I wouldn't expect to with the varying times.

I'll check the logs for search crawlers around those times and I'll ask the guys if they have any ideas.

Thanks for your help, Shay.

Cheers,
David.

On Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed by waiting on Elasticsearch. According to our monitoring, garbage collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both at the same time. GC count (and time) gradually increase (as you'd expect)... but with spikes occurring, typically 4-8 hours apart. These spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur, why they occur simultaneously, and what we can do to mitigate/eliminate the impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png

  • elastic_search_gc_count_es1.png

  • elastic_search_gc_count_es2.png

Yes, the same GC needs to occur even when swap is used, but, because of the nature of a VM, swapping it can result in much longer GC runs.
On Wednesday, March 16, 2011 at 1:00 AM, David Taylor wrote:

Hi Shay,

I read Elasticsearch Platform — Find real-time answers at scale | Elastic and can see that it's to prevent the process being swapped out. However, I'm not clear, what effect does this have on GC? Presumably the address space is still the same size and GC still needs to occur.

Thanks!

ps. We aim not to use swap and make sure we have more memory than needs to be allocated.

Cheers,
David.

On Wed, Mar 16, 2011 at 6:24 AM, Shay Banon shay.banon@elasticsearch.com wrote:

Yes, using mlockall is recommended, just to be on hte safe side of not allowing to OS to ever swap the ES process to disk. Make sure though that ulimits are set properly to allow to lock the amount of memory allocated to ES, and that MIN and MAX mem have the same value.
On Tuesday, March 15, 2011 at 8:22 PM, Brad Birnbaum wrote:

Shay, is this something that you recommend

On Tue, Mar 15, 2011 at 1:07 PM, Paul ppearcy@gmail.com wrote:

FYI, our GC performance issues were fixed up by enabling mlockall from
ES.

On Mar 15, 1:19 am, Shay Banon shay.ba...@elasticsearch.com wrote:

There isn't a way to control when a GC occurs, but we can try and understand why they occur. How do you monitor the GC count? It would be interesting to know which type of GC spiked, the ParNew or the CMS (if you are using the node stats, then you will see it there).

Some basic questions:

  • What operating system are you using?
  • Which JVM version are you using?
  • How much memory is allocated to the JVM?
  • Do you see in elasticsearch log a log statement indicating a GC has run, and the time it took?

Also, can you check the node stats and check the field cache and filter cache memory size? Do they increase? Does the filter cache occupies a lot of memory?

What might happen every 4-8 hours in terms of the system? Is there something that might create such load? One such reason can be if you provide a front facing web site, and allow to paginate through all the results (which might be really big resultset), and a search crawler might do that.

-shay.banonOn Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions slowed by waiting on Elasticsearch. According to our monitoring, garbage collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both at the same time. GC count (and time) gradually increase (as you'd expect)... but with spikes occurring, typically 4-8 hours apart. These spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes occur, why they occur simultaneously, and what we can do to mitigate/eliminate the impact (eg, do they have to do whatever it is they're doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

--
Brad Birnbaum
CTO, Assistly
brad@assistly.com m: 516-308-2286

Hi Shay,

We're using 0.14.2.

About the filters, I have no idea, I'll get back to you on that. :slight_smile:

Thanks.

Cheers,
David.

On Wed, Mar 16, 2011 at 10:01 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

I have a sneaky suspicion based on the node stats that it might be
because of excessive filter caching. Which version are you using? In 0.15.2,
I fixed a wrong double caching of filters when using filter facets.

Where do you typically use filters? And what type of filters do you use?

On Wednesday, March 16, 2011 at 12:35 AM, David Taylor wrote:

On Tue, Mar 15, 2011 at 6:19 PM, Shay Banon shay.banon@elasticsearch.comwrote:

There isn't a way to control when a GC occurs, but we can try and
understand why they occur. How do you monitor the GC count? It would be
interesting to know which type of GC spiked, the ParNew or the CMS (if you
are using the node stats, then you will see it there).

Hi Shay,

We're using http://localhost:9200/_cluster/health to pull out
jvm->gc->collection_count for each node. The graphs I attached show those
counts for each node and side-by-side for both nodes.

Some basic questions:

  • What operating system are you using?

We're using Ubuntu 10.10 (Maverick) x64, running on EC2 m1.large instances.

  • Which JVM version are you using?

java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)

  • How much memory is allocated to the JVM?

We're using the service wrapper with:

set.default.ES_MIN_MEM=8192
set.default.ES_MAX_MEM=8192
wrapper.java.initmemory=%ES_MIN_MEM%
wrapper.java.maxmemory=%ES_MAX_MEM%

  • Do you see in elasticsearch log a log statement indicating a GC has run,
    and the time it took?

No, we have GC logging disabled in production, so we're inferring that
rapid increases in GC counts indicate the times that GC has run.

Also, can you check the node stats and check the field cache and filter

cache memory size? Do they increase? Does the filter cache occupies a lot of
memory?

I noticed different memory sizes between the nodes, but I assume that can
simply be due to how they're used?

There are a high number of evictions on the master node. However, I
checked these stats yesterday and today, more than 12 hours apart, and the
evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},
. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

What might happen every 4-8 hours in terms of the system? Is there
something that might create such load? One such reason can be if you provide
a front facing web site, and allow to paginate through all the results
(which might be really big resultset), and a search crawler might do that.

That's a very good question. The period between each GC run (or, at least,
what I'm assuming is a GC run) varies. I can't see any cron jobs that would
cause it and I wouldn't expect to with the varying times.

I'll check the logs for search crawlers around those times and I'll ask the
guys if they have any ideas.

Thanks for your help, Shay.

Cheers,
David.

On Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:

Hi there,

We're noticing occasional slow-downs in New Relic, with transactions
slowed by waiting on Elasticsearch. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.

I've attached some graphs showing the JVM GC count on each server and both
at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.

Can someone help me understand what is happening when these GC spikes
occur, why they occur simultaneously, and what we can do to
mitigate/eliminate the impact (eg, do they have to do whatever it is they're
doing simultaneously)?

Thanks!

Cheers,
David.

Attachments:

  • elastic_search_gc_count_both.png
  • elastic_search_gc_count_es1.png
  • elastic_search_gc_count_es2.png

On Wed, Mar 16, 2011 at 9:35 AM, David Taylor david.taylor@assistly.comwrote:

There are a high number of evictions on the master node. However, I
checked these stats yesterday and today, more than 12 hours apart, and the
evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},
. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

Some additional info on the above. The filter cache size on the master has
increased by a factor of 10 since yesterday (it was approximately 640mb
yesterday, not it's 6.5gb).

Cheers,
David.

Right. Information on where you use filters and what type of filters will help. The fact that its the master node, btw, has no affect. Its just shards allocated on nodes, and search being executed on them, the master plays no specific role here.
On Wednesday, March 16, 2011 at 2:09 AM, David Taylor wrote:

On Wed, Mar 16, 2011 at 9:35 AM, David Taylor david.taylor@assistly.com wrote:

There are a high number of evictions on the master node. However, I checked these stats yesterday and today, more than 12 hours apart, and the evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},

. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

Some additional info on the above. The filter cache size on the master has increased by a factor of 10 since yesterday (it was approximately 640mb yesterday, not it's 6.5gb).

Cheers,
David.

Hi Shay,

Filters are used by our web application. They are only available to
authenticated users, so they can't be triggered by search crawlers.

We're using filtered facets.

Cheers,
David.

On Wed, Mar 16, 2011 at 11:16 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Right. Information on where you use filters and what type of filters will
help. The fact that its the master node, btw, has no affect. Its just shards
allocated on nodes, and search being executed on them, the master plays no
specific role here.

On Wednesday, March 16, 2011 at 2:09 AM, David Taylor wrote:

On Wed, Mar 16, 2011 at 9:35 AM, David Taylor david.taylor@assistly.comwrote:

There are a high number of evictions on the master node. However, I
checked these stats yesterday and today, more than 12 hours apart, and the
evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},
. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

Some additional info on the above. The filter cache size on the master has
increased by a factor of 10 since yesterday (it was approximately 640mb
yesterday, not it's 6.5gb).

Cheers,
David.

On Wednesday, March 16, 2011 at 5:54 AM, David Taylor wrote:
Hi Shay,

Filters are used by our web application. They are only available to authenticated users, so they can't be triggered by search crawlers.
This sounds reasonable.

We're using filtered facets.
Which filters?

Cheers,
David.

On Wed, Mar 16, 2011 at 11:16 AM, Shay Banon shay.banon@elasticsearch.com wrote:

Right. Information on where you use filters and what type of filters will help. The fact that its the master node, btw, has no affect. Its just shards allocated on nodes, and search being executed on them, the master plays no specific role here.
On Wednesday, March 16, 2011 at 2:09 AM, David Taylor wrote:

On Wed, Mar 16, 2011 at 9:35 AM, David Taylor david.taylor@assistly.com wrote:

There are a high number of evictions on the master node. However, I checked these stats yesterday and today, more than 12 hours apart, and the evictions have not changed (4,605 yesterday and still 4,605 today):

"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},

. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},

Some additional info on the above. The filter cache size on the master has increased by a factor of 10 since yesterday (it was approximately 640mb yesterday, not it's 6.5gb).

Cheers,
David.