Hi, we are doing performance testing of various types of environments to decide on a properly sized set of nodes for a number of Elasticsearch Cluster installations. When doing long-term indexing testing, we see a sharp drop of indexing performance in the beginning, stabilizing after some time. I.…

I don't have near the same info as you do Arun, but we have been seeing something similar, see here: CPU usage slowly climbs until ES needs a restart We're going to upgrade to ES 2.x and see if this continues, we're currently on 1.7.4. I've been following this post and if there's something I can he…

We have changed GC from G1 to default (not CMS), and problem disappeared, it seems... Of course, log now has entries from GC about long duration (5-10s), but... cluster works normal, indexing speed is 100M/5min with low LA... Sounds strange. We decided to switch GC to CMS and later to G1 again to s…

Thanks for the update ... with CMS, even though you have long collections, do you still see hot threads stuck reclaiming ThreadLocal? Can anyone else confirm switching away from G1 GC "fixes" these bad hot threads?

I never had ThreadLocal issues with G1 GC. For G1 GC, I removed -XX:+DisableExplicitGC from bin/elasticsearch.in.sh because it's unpredictable and unreliable, it does not prevent GC, instead I assume it prevents some ordinary full GC runs, but G1 GC is requiring just that to function properly. May…

I'll see if we can test this, when I grabbed the hot threads when this happened earlier this week I saw java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617).

@mikemccand no, I don't see any menthion of ThreadLocal at all. More than that, I noticed that Lucene merges don't make strong pressure on disks as it were. @jprante we have -XX:+DisableExplicitGC by default with ES, could it be the cause? We also will try to turn off this option. As for biased lo…

Well, we have tried to change GC, here are the results: Default GC - worked fine till it has been changed; G1 GC without -XX:+DisableExplicitGC - no, didn't helped. One of nodes still becomes overloaded with the same symptoms. CMS GC (original elasticsearch.in.sh without changes) - works perfect, …

Hey Ivan, We made the change from G1 to CMS on only one node in our cluster so far. In the 4 days since, so far the CMS node is looking good while one of the G1 nodes is already acting up like usual. Our problem manifests itself over a ~7-10 day period so I'm not ready to say anything more than tha…

I started seeing this issue again, i have followed the same steps for one of the host, changes the gc to CMS from G1, ill update the ticket in a day or so, as our host goes into this state in a day.

I'm pretty close to calling our issue solved by moving to CMS. The G1 nodes continues to need restarting every few days while the CMS node has now been up for almost 12 days without any issues. This is a significant record for us :smiley: We'll likely be swapping the GC on the whole cluster here sh…

Indexing performance degrading over time

Elastic Stack Elasticsearch

travisbell (Travis Bell) April 14, 2016, 1:48pm 59

We made the switch from G1 to CMS on the entire stack and not only has it indeed, solved the run away CPU problem, look at how much more stable the response times have:

Crazy to see such a difference in the higher percentiles.

Thanks to everyone here who helped figure this out, I can stop complaining about ES now

G1GC In Production with Regard to Consistency

Topic		Replies	Views
Rapidly Degrading Bulk Indexing Performance Elasticsearch	6	435	March 13, 2014
Debugging extremely slow indexing Elasticsearch	38	7399	January 19, 2021
ES5 Indexing performance (seems slow) Elasticsearch	28	36259	February 2, 2017
Periodic temporary cluster slowdown/freeze during long index process Elasticsearch	9	2025	January 6, 2011
Performance tuning ES for in-memory Elasticsearch	14	1345	April 27, 2014

Indexing performance degrading over time

Related topics