Hi Zach,
I've thought about it being GC before as our machines normally use almost
all of the 13GB allocated for the heap. However, I would have expected to
see the GC trace in the jstack dump and see the GC in the JVM monitoring
but that doesn't happen. There is no real change in memory footprint at the
time of the incident.
There are no facet queries or sorts on more than one field and the maximum
number of docs that can be searched for is 100, so no searches should be
overly memory hungry. We do however allow our users to write their own
query strings so I guess it's conceivable that a particular query could
cause the issue. Having checked the logs though I can't see any that stand
out as potential server killers.
There is a lot of indexing taking place on the cluster at times, the
traffic is both read and write heavy but runs at a fairly constant rate.
The disks are under a fair bit of load and there is some IO wait, although
it drops off when the cluster becomes unresponsive.
I've been looking at the jstack traces and in particular the IN_JAVA
threads. Most of them are in the
method org.apache.lucene.search.ReferenceManager.acquire() which seems
suspicious to me and suggests some sort of locking issue. I wouldn't expect
that many threads to be in that section of code at the same time under
normal operation but I'm in no way an expert on Lucene and it's internals.
Can anyone shed any light on this? Does it seem wrong that so many threads
are in org.apache.lucene.search.ReferenceManager.acquire?
Thanks,
Jon
On Tuesday, September 10, 2013 3:09:39 AM UTC+1, Zachary Tong wrote:
Hey Jon. My first guess is that you are experiencing Stop-the-world GC
cycles. These occur when the JVM runs out of heap space, or is getting
near enough to running out that it decides to run a GC. Normally a GC is
very fast, but if you have memory pressure (e.g. most of the heap is full
and all the objects are still in use) then the GC can take a very long
time. During these GCs, absolutely nothing happens - the world is stopped
while the GC tries to free memory.
When the the cluster becomes unresponsive, do you have memory/heap metrics
from that time? How much heap is normally utilized on a day-to-day basis?
Do you run expensive queries like facets or heavy sorts?
Some more potential things to consider:
- Since you mentioned high CPU, you may be hitting some heavy segment
merges on a particular node, and you may need to adjust your merge
throttling. Merges can eat up a lot of disk IO and CPU. Are you
performing heavy indexing?
- You have most of the available memory given to ES heap, so it is
possible some of the load could be coming from excessive thrashing of the
file system cache (although that would largely manifest as disk saturation
and not CPU)
- What does your query load look like? Is it possible for an
exceptionally heavy query to arise occasionally that is "abusive" to the
system (e.g. a request for 10k documents, or a very heavy script sort, etc)?
-Zach
On Monday, September 9, 2013 8:28:17 AM UTC-4, jon.sp...@skimlinks.comwrote:
Hi,
We've been having a problem with our production Elasticsearch cluster for
a while now which has caused some downtime and we're struggling to get to
the bottom of it.
The problem is that the cluster "locks up" and stops responding to any
requests. In our cluster of 3 machines, 2 will be effectively idle using
very little CPU whereas the 3rd will be using 100% CPU. This is the jstack
dump from the busy machine locked_elasticsearch_jstack.txt · GitHub.
Restarting ES normally on this box solves the problem - a 'kill -9' is
not required.
There are no errors in the logs and we can't see any unusual activity on
our apps to cause this. The problem has occurred 4 or 5 times over the last
few months and has happened using various versions (we are currently on
0.90.2).
Some info on our cluster:
OS: Ubuntu 12.04 LTS
ES version: 0.90.2
JVM: 1.7.0_21
ES_HEAP_SIZE: 13gb
Machines: 3 x XLarge EBS backed EC2 instances using provisioned IOPS,
15GB RAM
Indexes: 4 (3Gb, 8Gb, 10Gb and 34Gb)
Shards: 12 on each index
Replicas: 2 for each index
We've run out of ideas so any help would be appreciated.
Thanks,
Jon
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.