Indexing performance degrading over time

Yes i have seen the same thread stuck in this state.
the only way to clear up this is to send a QUIT signal to the process.

But this does not explain why the issue was resolved after adding clientNodes ?

What I meant was: once a thread is stuck here, is it forever here? I'm trying to differentiate between "threads are forever stuck in this code" vs "they are spending a lot of time in this code" (which may be because ES is creating too many ThreadLocals or something).

But this does not explain why the issue was resolved after adding clientNodes ?

Is this a 100% reliable workaround?

Maybe this changed a bit how the ES nodes are using ThreadLocals and sidestepped the (possible) JVM bug? Or this is not a JVM bug at all and something else is at work (too many ThreadLocals).

Our current workaround is to stop uploading data for 20-30min, and after continuing uploading, node with bulk queue is disappearing and indexing speed is greatly increasing. At the moment, we do it every late evening, seems because indices become very large at the evening.

I run this OQL script on jhat to collect some statistics:

select (function(table){
    var max_nonnull_range = 0;
    var current_range_value = 0;
    var nonnull_entries = 0;
    var nonnull_key_count = 0;
    var nonnull_value_count = 0;
    var nonnull_value_value_count = 0;
    var nonnull_value_value_classname = {};
    for(var i = 0; i < table.length; i++) {
        if(table[i]) {
            nonnull_entries++;
            current_range_value++;
            if(max_nonnull_range < current_range_value) {
                max_nonnull_range = current_range_value;
            }
            var entry = table[i];
            if(entry.referent) {
                nonnull_key_count++;
            }
            if(entry.value) {
                nonnull_value_count++;
                if(entry.value && entry.value.referent) {
                    nonnull_value_value_count++;
                    var key = classof(entry.value.referent).name;
                    if(typeof nonnull_value_value_classname[key] != "undefined") {
                        nonnull_value_value_classname[key]++;
                    } else {
                        nonnull_value_value_classname[key] = 1;
                    }
                }
            }
        }
        else {
            current_range_value = 0;
        }
    }
    return {
        "table_length": table.length,
        "max_nonnull_range": max_nonnull_range,
        "nonnull_entries": nonnull_entries,
        "nonnull_key_count": nonnull_key_count,
        "nonnull_value_count": nonnull_value_count,
        "nonnull_value_value_count": nonnull_value_value_count,
        "nonnull_value_value_classname": nonnull_value_value_classname
    };
})(s.threadLocals.table)
from java.lang.Thread s
where /bulk/.test(s.name)

Here is the output for dumps of Elasticsearch in different states: http://pastebin.com/Wzi0FGVE
I see that probably the cause of high CPU usage and appearing methods of ThreadLocal class in hot threads is in big ranges of entries in that table. But the only one big difference of dumps between overloaded and resting node is in count of non-null keys in entries.

By the way, we don't see expungeStaleEntry in hot threads, but getEntryAfterMiss.

So we had this issue again, and so it shows that this is not a reliable workaround. this issue is happening whenever there is spike in index and search call volume.

From our logs from the recent event, we see that too many threads are in this state based on the load on the system.
state been using ThreadLocals, after the load dies down the cpu goes down but does not recover completely

I don't have near the same info as you do Arun, but we have been seeing something similar, see here: CPU usage slowly climbs until ES needs a restart

We're going to upgrade to ES 2.x and see if this continues, we're currently on 1.7.4. I've been following this post and if there's something I can help provide I'm happy to do so.

We have changed GC from G1 to default (not CMS), and problem disappeared, it seems... Of course, log now has entries from GC about long duration (5-10s), but... cluster works normal, indexing speed is 100M/5min with low LA... Sounds strange.
We decided to switch GC to CMS and later to G1 again to see if the problem will return.

Thanks for the update ... with CMS, even though you have long collections, do you still see hot threads stuck reclaiming ThreadLocal?

Can anyone else confirm switching away from G1 GC "fixes" these bad hot threads?

I never had ThreadLocal issues with G1 GC. For G1 GC, I removed -XX:+DisableExplicitGC from bin/elasticsearch.in.sh because it's unpredictable and unreliable, it does not prevent GC, instead I assume it prevents some ordinary full GC runs, but G1 GC is requiring just that to function properly.

Maybe the option -XX:-UseBiasedLocking should also be set, but it may depend on the CPU (!), Sandy Bridge or later, or not: http://mechanical-sympathy.blogspot.de/2011/11/java-lock-implementations.html

I'll see if we can test this, when I grabbed the hot threads when this happened earlier this week I saw java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617).

@mikemccand no, I don't see any menthion of ThreadLocal at all. More than that, I noticed that Lucene merges don't make strong pressure on disks as it were.

@jprante we have -XX:+DisableExplicitGC by default with ES, could it be the cause? We also will try to turn off this option. As for biased locks, I don't see the real difference between biased and non-biased locks on the graphs... moreover, AMD is working on our hosts

Well, we have tried to change GC, here are the results:

  • Default GC - worked fine till it has been changed;
  • G1 GC without -XX:+DisableExplicitGC - no, didn't helped. One of nodes still becomes overloaded with the same symptoms.
  • CMS GC (original elasticsearch.in.sh without changes) - works perfect, and even throughput is greater, without long GC pauses.

Yeah, you are right with your default configuration. Well done, guys, keep up the good work!

As for G1, may be submit a bug for it? But I don't know how it can be reproduced in code.

Hey Ivan,

We made the change from G1 to CMS on only one node in our cluster so far. In the 4 days since, so far the CMS node is looking good while one of the G1 nodes is already acting up like usual. Our problem manifests itself over a ~7-10 day period so I'm not ready to say anything more than that just yet.

I started seeing this issue again, i have followed the same steps for one of the host, changes the gc to CMS from G1, ill update the ticket in a day or so, as our host goes into this state in a day.

I'm pretty close to calling our issue solved by moving to CMS. The G1 nodes continues to need restarting every few days while the CMS node has now been up for almost 12 days without any issues. This is a significant record for us :smiley:

We'll likely be swapping the GC on the whole cluster here shortly.

We made the switch from G1 to CMS on the entire stack and not only has it indeed, solved the run away CPU problem, look at how much more stable the response times have:

Crazy to see such a difference in the higher percentiles.

Thanks to everyone here who helped figure this out, I can stop complaining about ES now :smiley:

Can you specify your Machine (CPU cores, RAM, disks), OS, and JVM including JVM parameters?

It would be interesting to reproduce. From my numbers, I see it just the other way round :slight_smile:

For us, these nodes are c3.2xlarge's on AWS set with a 7.5GB heap. We are still on ES 1.7.4 though, and running java 1.8.0_66. We'll be making the update to ES 2.x in a few months and will no doubt retry this whole experiment, since it was so easy to reproduce.

AWS uses Intel E5-2680 v2 CPUs (10 core, 20 threads/vcpus) for c2.2xlarge which is declared with only 8 threads/vcpus for the product. That means IMHO, only 4 real cores are used, which is certainly bit low for G1, because G1 produces lots of CPU overhead, it scales much better the higher the core count is.

Also, 7.5 GB heap size is at the lower bottom of 6 GB, which is where G1 GC is targeted at:

https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html

Oracle only sees advantages of G1 when

  • More than 50% of the Java heap is occupied with live data.
  • The rate of object allocation rate or promotion varies significantly.
  • The application is experiencing undesired long garbage collection or compaction pauses (longer than 0.5 to 1 second).

All of this has been improved over the years by ES, by JVM parameter tuning for heap to eliminate full CMS GC runs, and reducing live object count on heap, especially in ES 2.x

So, using G1 will lead to lower GC latency, but the price is lower throughput and lower performance .

But G1 is not responsible for any indexing performance degrading over time, as this GCViewer image shows

CPU: 2x AMD Opteron 6274 (32 cores)
RAM: 64 GB
Disk: 8x450GB SAS2 6Gb/s (RAID1+0)
OS: RHEL 7.2
JDK: Oracle Java 8u77
Command line flags: -XX:GCLogFileSize=10485760 -XX:InitialHeapSize=17179869184 -XX:MaxGCPauseMillis=1000 -XX:MaxHeapSize=17179869184 -XX:NumberOfGCLogFiles=32 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC -XX:+UseGCLogFileRotation

My throughput numbers for a mixed workload which indexes ~98 mio documents while collecting docs from other indices by executing ~200-300 mio. term queries

10 minutes: ingest: 1936194 docs, 4.85 MB/s
20 minutes: ingest: 4423150 docs, 5.35 MB/s
30 minutes: ingest: 6708534 docs, 5.45 MB/s
1 hour: ingest: 13858395 docs, 5.63 MB/s
2 hours: ingest: 27939888 docs, 5.66 MB/s
3 hours: ingest: 41825762 docs, 5.68 MB/s
4 hours: ingest: 55708007 docs, 5.71 MB/s
5 hours: ingest: 69378607 docs, 5.69 MB/s
6 hours: ingest: 82587068 docs, 5.69 MB/s
7 hours: ingest: 96188962 docs, 5.66 MB/s
8 hours: ingest: 110111536 docs, 5.67 MB/s
9 hours: ingest: 124565124 docs, 5.69 MB/s
9 hours 30 minutes: ingest: 131873275 docs, 5.72 MB/s
(total ingest after 9h42m: 192.7 GB)

Index performance degradation strongly depends on the resources available to ES, and mostly if heap size is configured too low for the assigned workload. Then, GC will always freak out to avoid OOM at all cost, no matter if CMS or G1 or any other GC algorithm.

Hi Jörg,

Indeed there is lots of testing and tweaking one should do with their particular usage and cluster size. All of the above information is useful for anyone else coming across this thread.

Just making sure it was clear, we never saw the indexing slowdown issue. We were seeing the other issue described in this thread which was runaway CPU problems that could only solved by restarting the troubled node. I would hazard to say that since we do not see this issue with CMS, there is still some bugs with G1GC. I know Elastic has been cautious about recommending G1 for reasons exactly like this, clearly something in our index design and/or query was tripping on a bug that is very rare. The problem is completely gone with CMS and yet 100% reproducible with G1.

For now all that matters is we have a stable ES cluster. We migrated from Solr, and those instances at the time had been up for > 2 years without any issues whatsoever. We migrated to ES and had nothing but problems. I guess we should have never enabled G1 in the first place. :confused: