CPU usage slowly climbs until ES needs a restart

travisbell · January 3, 2016, 3:59pm

Hi everyone,

We're seeing kind of an annoying issue with our ES cluster which is that after 7-10 days (this is not a science, it's just whenever the problem becomes bad enough) the CPU ES is using is 2-3x more than after a service restart.

Let me put this a slightly other way just for clarity... ES runs fine for a while then all of a sudden we notice CPU usage start climbing even though the load (requests we're sending to ES) is the same. It climbs, and climbs and climbs until the latency of the search requests becomes high enough that is starts to affect availability. So we restart ES. Boom, just like that, with the same load as pre-restart we save 2-3x CPU and things will smooth out like this for again, 7-10 days until it all happens again. We've been in this loop now for 6 weeks.

Has anyone seen anything like this? Here are some graphs of before/after a restart (~4:00 on the graph, you can see the network doesn't change but then look at that CPU difference!)

This is ES 1.7.3 (java 1.8.0_66) running on AWS c3.2xlarge's (15GB RAM) with a 7.5GB heap. I don't see Marvel reporting anything strange during these events. Aside from latency climbing with CPU, there is nothing out of the ordinary that I am able to discern.

Thanks in advance to anyone who can help shed some light, cheers!

Christian_Dahlqvist · January 3, 2016, 4:30pm

What does the hot thread API show when the CPU utilisation is high? Is there anything in the logs that differ from when the cluster has been restarted?

travisbell · January 3, 2016, 4:52pm

Hi Christian,

I'll be sure to grab some snapshots when this happens next (in a week or so's time). We restarted the service last night so right now everything is running good. Here's a few grabs form right now, although ES is operating within the normal expectations right now.

gist.github.com

https://gist.github.com/travisbell/42dc936ef33a27dcc4fe

jan-3-1.txt

::: [xx-host-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-03T16:39:21.139Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   28.5% (142.4ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][marvel.exporters]'
     2/10 snapshots sharing following 7 elements
       org.elasticsearch.action.admin.indices.stats.ShardStats.<init>(ShardStats.java:55)
       org.elasticsearch.indices.IndicesService.stats(IndicesService.java:231)
       org.elasticsearch.indices.IndicesService.stats(IndicesService.java:188)
       org.elasticsearch.node.service.NodeService.stats(NodeService.java:138)
       org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:342)

This file has been truncated. show original

jan-3-2.txt

::: [xx-host-prod-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-prod-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-03T16:41:41.144Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   12.8% (64ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-prod-1400][search][T#6]'
     2/10 snapshots sharing following 6 elements
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:333)
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:330)
       org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

This file has been truncated. show original

jan-3-3.txt

::: [xx-host-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-03T16:42:21.203Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   23.3% (116.4ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#12]'
     3/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
   
   13.4% (67.1ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#13]'
     8/10 snapshots sharing following 2 elements

This file has been truncated. show original

travisbell · January 13, 2016, 3:31pm

Alright, so here we are again, 9 days later.

I have a 6 hot thread calls here (this is before the restart):

gist.github.com

https://gist.github.com/travisbell/c08f10b928db9c538430

jan-12-1.txt

::: [xx-host-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-12T20:51:55.974Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   40.1% (200.4ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#6]'
     7/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745)
   
   38.7% (193.5ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#4]'
     2/10 snapshots sharing following 33 elements

This file has been truncated. show original

jan-12-2.txt

::: [xx-host-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-12T20:52:09.800Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   31.4% (157ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#7]'
     3/10 snapshots sharing following 6 elements
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:333)
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:330)
       org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

This file has been truncated. show original

jan-12-3.txt

::: [xx-host-1400][Y4qXPFeeRK2uKbggdiReig][xx-host-1400.domain.com][inet[/1400_ip:9300]]{aws_availability_zone=us-east-1d, master=true}
   Hot threads at 2016-01-12T20:52:15.416Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   37.0% (185ms out of 500ms) cpu usage by thread 'elasticsearch[xx-host-1400][search][T#2]'
     5/10 snapshots sharing following 6 elements
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:333)
       org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:330)
       org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:559)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

This file has been truncated. show original

There are more than three files. show original

The "1100" host is the green host peaking at 100% CPU.

travisbell · April 11, 2016, 4:55pm

For the sake of any other potential user searching around and finding this thread, over on this other post here it was mentioned that switching from G1 GC to CMS solved another user having a similar issue as this. We made this change on our stack and low and behold, this problem went away.

We even ran both GC's simultaneously on different nodes and the G1 nodes continued to act up while the CMS node was fine. It's been 15 days now and the CMS node hasn't needed a restart. Hope this can help any one who comes across this bizarre problem.

Topic		Replies	Views
Incremental CPU usage and query queues Elasticsearch	1	826	July 5, 2017
ES In Prod periodically jumping to constant high CPU and super slow queries Elasticsearch	3	529	August 25, 2017
High CPU usage every day at the same time Elasticsearch	1	692	July 5, 2017
CPU usage increase after running for a while (CacheRecycler?) Elasticsearch	22	1077	July 6, 2017
High CPU usage / load average while no running queries Elasticsearch	16	23184	February 5, 2019

CPU usage slowly climbs until ES needs a restart

Related topics