On Tue, Mar 15, 2011 at 6:19 PM, Shay Banon shay.banon@elasticsearch.comwrote:
There isn't a way to control when a GC occurs, but we can try and
understand why they occur. How do you monitor the GC count? It would be
interesting to know which type of GC spiked, the ParNew or the CMS (if you
are using the node stats, then you will see it there).
Hi Shay,
We're using http://localhost:9200/_cluster/health to pull out
jvm->gc->collection_count for each node. The graphs I attached show those
counts for each node and side-by-side for both nodes.
Some basic questions:
- What operating system are you using?
We're using Ubuntu 10.10 (Maverick) x64, running on EC2 m1.large instances.
- Which JVM version are you using?
java version "1.6.0_23"
Java(TM) SE Runtime Environment (build 1.6.0_23-b05)
Java HotSpot(TM) 64-Bit Server VM (build 19.0-b09, mixed mode)
- How much memory is allocated to the JVM?
We're using the service wrapper with:
set.default.ES_MIN_MEM=8192
set.default.ES_MAX_MEM=8192
wrapper.java.initmemory=%ES_MIN_MEM%
wrapper.java.maxmemory=%ES_MAX_MEM%
- Do you see in elasticsearch log a log statement indicating a GC has run,
and the time it took?
No, we have GC logging disabled in production, so we're inferring that rapid
increases in GC counts indicate the times that GC has run.
Also, can you check the node stats and check the field cache and filter
cache memory size? Do they increase? Does the filter cache occupies a lot of
memory?
I noticed different memory sizes between the nodes, but I assume that can
simply be due to how they're used?
There are a high number of evictions on the master node. However, I checked
these stats yesterday and today, more than 12 hours apart, and the evictions
have not changed (4,605 yesterday and still 4,605 today):
"nodes" : {
"UMi0Q32TTNOt14Fjwd-oUA" : {
"name" : "Black Tom Cassidy",
"indices" : {
"store_size" : "3.4gb",
"store_size_in_bytes" : 3709373643,
"field_cache_evictions" : 4605,
"field_cache_size" : "190.4mb",
"field_cache_size_in_bytes" : 199713560,
"filter_cache_size" : "6.5gb",
"filter_cache_size_in_bytes" : 7019021448
},
. . .
"nb_CvB7wTSqrEt928agjnA" : {
"name" : "X-Man",
"indices" : {
"store_size" : "4.7gb",
"store_size_in_bytes" : 5063162525,
"field_cache_evictions" : 0,
"field_cache_size" : "247.1mb",
"field_cache_size_in_bytes" : 259161452,
"filter_cache_size" : "2.5gb",
"filter_cache_size_in_bytes" : 2788778984
},
What might happen every 4-8 hours in terms of the system? Is there
something that might create such load? One such reason can be if you provide
a front facing web site, and allow to paginate through all the results
(which might be really big resultset), and a search crawler might do that.
That's a very good question. The period between each GC run (or, at least,
what I'm assuming is a GC run) varies. I can't see any cron jobs that would
cause it and I wouldn't expect to with the varying times.
I'll check the logs for search crawlers around those times and I'll ask the
guys if they have any ideas.
Thanks for your help, Shay.
Cheers,
David.
On Tuesday, March 15, 2011 at 5:24 AM, David Taylor wrote:
Hi there,
We're noticing occasional slow-downs in New Relic, with transactions slowed
by waiting on Elasticsearch. According to our monitoring, garbage
collection appears to be spiking at the time of the slow-down.
I've attached some graphs showing the JVM GC count on each server and both
at the same time. GC count (and time) gradually increase (as you'd
expect)... but with spikes occurring, typically 4-8 hours apart. These
spikes occur simultaneously on both nodes.
Can someone help me understand what is happening when these GC spikes
occur, why they occur simultaneously, and what we can do to
mitigate/eliminate the impact (eg, do they have to do whatever it is they're
doing simultaneously)?
Thanks!
Cheers,
David.
Attachments:
- elastic_search_gc_count_both.png
- elastic_search_gc_count_es1.png
- elastic_search_gc_count_es2.png