Indexing performance degrading over time

I don't have near the same info as you do Arun, but we have been seeing something similar, see here: CPU usage slowly climbs until ES needs a restart

We're going to upgrade to ES 2.x and see if this continues, we're currently on 1.7.4. I've been following this post and if there's something I can help provide I'm happy to do so.

We have changed GC from G1 to default (not CMS), and problem disappeared, it seems... Of course, log now has entries from GC about long duration (5-10s), but... cluster works normal, indexing speed is 100M/5min with low LA... Sounds strange.
We decided to switch GC to CMS and later to G1 again to see if the problem will return.

Thanks for the update ... with CMS, even though you have long collections, do you still see hot threads stuck reclaiming ThreadLocal?

Can anyone else confirm switching away from G1 GC "fixes" these bad hot threads?

I never had ThreadLocal issues with G1 GC. For G1 GC, I removed -XX:+DisableExplicitGC from bin/elasticsearch.in.sh because it's unpredictable and unreliable, it does not prevent GC, instead I assume it prevents some ordinary full GC runs, but G1 GC is requiring just that to function properly.

Maybe the option -XX:-UseBiasedLocking should also be set, but it may depend on the CPU (!), Sandy Bridge or later, or not: http://mechanical-sympathy.blogspot.de/2011/11/java-lock-implementations.html

I'll see if we can test this, when I grabbed the hot threads when this happened earlier this week I saw java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617).

@mikemccand no, I don't see any menthion of ThreadLocal at all. More than that, I noticed that Lucene merges don't make strong pressure on disks as it were.

@jprante we have -XX:+DisableExplicitGC by default with ES, could it be the cause? We also will try to turn off this option. As for biased locks, I don't see the real difference between biased and non-biased locks on the graphs... moreover, AMD is working on our hosts

Well, we have tried to change GC, here are the results:

  • Default GC - worked fine till it has been changed;
  • G1 GC without -XX:+DisableExplicitGC - no, didn't helped. One of nodes still becomes overloaded with the same symptoms.
  • CMS GC (original elasticsearch.in.sh without changes) - works perfect, and even throughput is greater, without long GC pauses.

Yeah, you are right with your default configuration. Well done, guys, keep up the good work!

As for G1, may be submit a bug for it? But I don't know how it can be reproduced in code.

Hey Ivan,

We made the change from G1 to CMS on only one node in our cluster so far. In the 4 days since, so far the CMS node is looking good while one of the G1 nodes is already acting up like usual. Our problem manifests itself over a ~7-10 day period so I'm not ready to say anything more than that just yet.

I started seeing this issue again, i have followed the same steps for one of the host, changes the gc to CMS from G1, ill update the ticket in a day or so, as our host goes into this state in a day.

I'm pretty close to calling our issue solved by moving to CMS. The G1 nodes continues to need restarting every few days while the CMS node has now been up for almost 12 days without any issues. This is a significant record for us :smiley:

We'll likely be swapping the GC on the whole cluster here shortly.

We made the switch from G1 to CMS on the entire stack and not only has it indeed, solved the run away CPU problem, look at how much more stable the response times have:

Crazy to see such a difference in the higher percentiles.

Thanks to everyone here who helped figure this out, I can stop complaining about ES now :smiley:

Can you specify your Machine (CPU cores, RAM, disks), OS, and JVM including JVM parameters?

It would be interesting to reproduce. From my numbers, I see it just the other way round :slight_smile:

For us, these nodes are c3.2xlarge's on AWS set with a 7.5GB heap. We are still on ES 1.7.4 though, and running java 1.8.0_66. We'll be making the update to ES 2.x in a few months and will no doubt retry this whole experiment, since it was so easy to reproduce.

AWS uses Intel E5-2680 v2 CPUs (10 core, 20 threads/vcpus) for c2.2xlarge which is declared with only 8 threads/vcpus for the product. That means IMHO, only 4 real cores are used, which is certainly bit low for G1, because G1 produces lots of CPU overhead, it scales much better the higher the core count is.

Also, 7.5 GB heap size is at the lower bottom of 6 GB, which is where G1 GC is targeted at:

https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html

Oracle only sees advantages of G1 when

  • More than 50% of the Java heap is occupied with live data.
  • The rate of object allocation rate or promotion varies significantly.
  • The application is experiencing undesired long garbage collection or compaction pauses (longer than 0.5 to 1 second).

All of this has been improved over the years by ES, by JVM parameter tuning for heap to eliminate full CMS GC runs, and reducing live object count on heap, especially in ES 2.x

So, using G1 will lead to lower GC latency, but the price is lower throughput and lower performance .

But G1 is not responsible for any indexing performance degrading over time, as this GCViewer image shows

CPU: 2x AMD Opteron 6274 (32 cores)
RAM: 64 GB
Disk: 8x450GB SAS2 6Gb/s (RAID1+0)
OS: RHEL 7.2
JDK: Oracle Java 8u77
Command line flags: -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=17179869184 -XX:MaxGCPauseMillis=1000 -XX:MaxHeapSize=17179869184 -XX:NumberOfGCLogFiles=32 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation

My throughput numbers for a mixed workload which indexes ~98 mio documents while collecting docs from other indices by executing ~200-300 mio. term queries

10 minutes: ingest: 1936194 docs, 4.85 MB/s
20 minutes: ingest: 4423150 docs, 5.35 MB/s
30 minutes: ingest: 6708534 docs, 5.45 MB/s
1 hour: ingest: 13858395 docs, 5.63 MB/s
2 hours: ingest: 27939888 docs, 5.66 MB/s
3 hours: ingest: 41825762 docs, 5.68 MB/s
4 hours: ingest: 55708007 docs, 5.71 MB/s
5 hours: ingest: 69378607 docs, 5.69 MB/s
6 hours: ingest: 82587068 docs, 5.69 MB/s
7 hours: ingest: 96188962 docs, 5.66 MB/s
8 hours: ingest: 110111536 docs, 5.67 MB/s
9 hours: ingest: 124565124 docs, 5.69 MB/s
9 hours 30 minutes: ingest: 131873275 docs, 5.72 MB/s
(total ingest after 9h42m: 192.7 GB)

Index performance degradation strongly depends on the resources available to ES, and mostly if heap size is configured too low for the assigned workload. Then, GC will always freak out to avoid OOM at all cost, no matter if CMS or G1 or any other GC algorithm.

Hi Jörg,

Indeed there is lots of testing and tweaking one should do with their particular usage and cluster size. All of the above information is useful for anyone else coming across this thread.

Just making sure it was clear, we never saw the indexing slowdown issue. We were seeing the other issue described in this thread which was runaway CPU problems that could only solved by restarting the troubled node. I would hazard to say that since we do not see this issue with CMS, there is still some bugs with G1GC. I know Elastic has been cautious about recommending G1 for reasons exactly like this, clearly something in our index design and/or query was tripping on a bug that is very rare. The problem is completely gone with CMS and yet 100% reproducible with G1.

For now all that matters is we have a stable ES cluster. We migrated from Solr, and those instances at the time had been up for > 2 years without any issues whatsoever. We migrated to ES and had nothing but problems. I guess we should have never enabled G1 in the first place. :confused:

Hi Jörg,
our setup for a data-node: 2x AMD Opteron(tm) Processor 6376, 128Gbytes RAM, HP MSA 2040 (3 mount points EXT4 on RAID-6). On this server there are two ES instances (data node, 30,5Gbytes for JVM each) and one custom data loader (client node 4Gbytes RAM).

I had run G1 GC tests on my machines for weeks since end of 2011 on early JDK 8 and never saw a single crash with data loss caused by G1 GC.

Using G1 GC is not for beginners and there are some rules to follow from my own experience:

  • prefer physical machines (there can be subtle memory management issues in VM guest/host implementations)
  • prefer latest kernel/OS (to include all available fixes for memory management issues outside JVM)
  • use always Java 8 (G1 GC on JDK7 is a nightmare, especially with the permgen space issues)
  • prefer one 64bit JVM per machine for data nodes (otherwise many G1 GC instances compete against each other and steal too much CPU cycles)
  • reserve at least 6 GB heap with XX:+UseG1GC (otherwise regionized heap makes not much sense and G1 is under stress with low-sized heaps)
  • do not use -Xmn or any other -XX heap parameter "optimization" with -XX:+UseG1GC (or G1 GC will run crazy)

It's true that CMS GC is a very old and stable part of the JVM, where JVM engineers take care to not change code to keep stability, and G1 is relatively new where more code changes happen.

There are also other severe issues in JVM GC that may cause crashes (that can lead to data loss), so I think the statement "there are bugs in G1 GC so I can't use it" should not be overrated. Any JVM crash is serious, especially when triggered by Lucene/Elasticsearch heavy duty testing, but there are also other failing components in the JVM than just G1 GC that need to be fixed.

AFAIK ES detected G1 GC crashes against early releases of G1GC in JDK 7 and 32-bit on virtual machines which are obviously prone to crashes because a 32bit JVM has reduced address space management capabilities. These test configurations are not part of my configurations, hence I could not reproduce crashes.

Example search for open JDK issues:

https://bugs.openjdk.java.net/browse/JDK-8152740?jql=project%20%3D%20JDK%20AND%20issuetype%20%3D%20Bug%20AND%20status%20%3D%20Open%20AND%20text%20~%20"crash%20gc"

1 Like

We've seen crashes on JDK 8 builds too, and there is the FP register corruption bug that impacts at least 8u60 through the latest 8u92. For further details on our current standpoint on G1 GC please see A Heap of Trouble.

Though this solved the issue of one host going hot, we had the stop the world phase for our cluster and it went red for a while, and then started the reallocation for the entire fleet.
We have a heap of 30Gb, our next step is to reduce the heap size and find out what is the impact of the same