High CPU usage / load average while no running queries

Hello everyone!

I am a beginner with ES, and I am encountering my first performance issue.

Setup:

  • Hardware: 2 vCore, 8 GB RAM, 80 GB SSD
  • Ubuntu 18.04
  • Java 1.8.0_191
  • ElasticSearch 6.5.1
  • Default ES configuration
  • 1 cluster, 1 node, 2 indices
  • first index: 150 000 documents (texts + metadata), 3.7 GB store
  • second index: 1 500 000 documents (tags sets), 0.8 GB store

Usage:

  • Rare writtings / deletions (~once a month, +25 000 docs each time)
  • 3 queries / second at most

Problem:

Everything was working fine, until the last writting where the CPU ended being used at 100%. The writting ended well, but the load average keep being at 100%. When I shutdown everything on the server, load average becomes normal, until I restart ES; then it comes again at 100% (while no query running).

I am digging documentation about how to solve my issue but I feel confused and do not know where to start / what to try.

Cluster health, nodes stats and stats: https://gist.github.com/Bahanix/a946db33bdff706acb561ad57ee2ab90

Preview:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 10,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 10,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 50.0
}

Downtime is possible and rebuilding indices is ok too if needed.

Do you have an idea of things I could read / try?

Thank you for reading

What does the hot threads API show when CPU usage is unexpectedly high?

1 Like

Hello, /_nodes/hot_threads returns:

::: {sHsHRsf}{sHsHRsflTtigKOko_FvnVg}{2IQyThMHQF23eoouaLsa7w}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8165040128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2019-01-01T14:26:03.432, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
    0.0% (145.1micros out of 500ms) cpu usage by thread 'elasticsearch[sHsHRsf][[timer]]'
     10/10 snapshots sharing following 2 elements
       java.lang.Thread.sleep(Native Method)
       org.elasticsearch.threadpool.ThreadPool$CachedTimeThread.run(ThreadPool.java:547)

Do you see anything in the logs about long or frequent GC?

1 Like

Each request/query ends in these GC logs:

2019-01-02T14:33:52.738+0100: 93297.860: Total time for which application threads were stopped: 0.1159665 seconds, Stopping threads took: 0.0469178 seconds
2019-01-02T14:33:54.286+0100: 93299.408: Total time for which application threads were stopped: 0.0003418 seconds, Stopping threads took: 0.0000811 seconds
2019-01-02T14:34:14.733+0100: 93319.855: Total time for which application threads were stopped: 0.0016874 seconds, Stopping threads took: 0.0013020 seconds
2019-01-02T14:34:14.734+0100: 93319.856: Total time for which application threads were stopped: 0.0001006 seconds, Stopping threads took: 0.0000226 seconds
2019-01-02T14:34:14.945+0100: 93320.068: Total time for which application threads were stopped: 0.0004190 seconds, Stopping threads took: 0.0002370 seconds
2019-01-02T14:34:14.953+0100: 93320.075: Total time for which application threads were stopped: 0.0051206 seconds, Stopping threads took: 0.0014569 seconds
2019-01-02T14:34:15.987+0100: 93321.110: Total time for which application threads were stopped: 0.0137072 seconds, Stopping threads took: 0.0134333 seconds
2019-01-02T14:34:16.999+0100: 93322.121: Total time for which application threads were stopped: 0.0105055 seconds, Stopping threads took: 0.0089762 seconds
2019-01-02T14:34:19.053+0100: 93324.175: Total time for which application threads were stopped: 0.0539037 seconds, Stopping threads took: 0.0526524 seconds
2019-01-02T14:34:20.036+0100: 93325.158: Total time for which application threads were stopped: 0.0178477 seconds, Stopping threads took: 0.0167766 seconds
2019-01-02T14:34:20.363+0100: 93325.485: Total time for which application threads were stopped: 0.0002115 seconds, Stopping threads took: 0.0000587 seconds
2019-01-02T14:34:45.088+0100: 93350.210: Total time for which application threads were stopped: 0.0004575 seconds, Stopping threads took: 0.0001255 seconds
2019-01-02T14:34:45.089+0100: 93350.211: Total time for which application threads were stopped: 0.0001138 seconds, Stopping threads took: 0.0000229 seconds
2019-01-02T14:34:46.115+0100: 93351.237: Total time for which application threads were stopped: 0.0254978 seconds, Stopping threads took: 0.0252771 seconds
2019-01-02T14:34:46.201+0100: 93351.323: Total time for which application threads were stopped: 0.0109330 seconds, Stopping threads took: 0.0107341 seconds
2019-01-02T14:34:46.246+0100: 93351.368: Total time for which application threads were stopped: 0.0396406 seconds, Stopping threads took: 0.0394395 seconds
2019-01-02T14:34:46.717+0100: 93351.839: Total time for which application threads were stopped: 0.0076333 seconds, Stopping threads took: 0.0073539 seconds
2019-01-02T14:34:49.480+0100: 93354.602: Total time for which application threads were stopped: 0.0098339 seconds, Stopping threads took: 0.0040309 seconds
2019-01-02T14:35:49.870+0100: 93414.997: Total time for which application threads were stopped: 0.6677784 seconds, Stopping threads took: 0.1241825 seconds
2019-01-02T14:35:50.076+0100: 93415.198: Total time for which application threads were stopped: 0.0014716 seconds, Stopping threads took: 0.0000543 seconds
2019-01-02T14:35:54.090+0100: 93419.212: Total time for which application threads were stopped: 0.0125148 seconds, Stopping threads took: 0.0092182 seconds
2019-01-02T14:35:55.091+0100: 93420.213: Total time for which application threads were stopped: 0.0007178 seconds, Stopping threads took: 0.0005540 seconds
2019-01-02T14:35:56.093+0100: 93421.216: Total time for which application threads were stopped: 0.0011982 seconds, Stopping threads took: 0.0000786 seconds
2019-01-02T14:36:01.104+0100: 93426.226: Total time for which application threads were stopped: 0.0094146 seconds, Stopping threads took: 0.0005936 seconds
2019-01-02T14:36:01.920+0100: 93427.042: Total time for which application threads were stopped: 0.0003037 seconds, Stopping threads took: 0.0000605 seconds
2019-01-02T14:36:50.955+0100: 93476.078: Total time for which application threads were stopped: 0.0259146 seconds, Stopping threads took: 0.0156997 seconds
2019-01-02T14:36:56.042+0100: 93481.165: Total time for which application threads were stopped: 0.0287458 seconds, Stopping threads took: 0.0210005 seconds
2019-01-02T14:36:57.803+0100: 93482.925: Total time for which application threads were stopped: 0.0209145 seconds, Stopping threads took: 0.0040108 seconds
2019-01-02T14:36:57.819+0100: 93482.942: Total time for which application threads were stopped: 0.0043863 seconds, Stopping threads took: 0.0042021 seconds
2019-01-02T14:37:01.132+0100: 93486.254: [GC (Allocation Failure) 2019-01-02T14:37:01.139+0100: 93486.261: [ParNew
Desired survivor size 8716288 bytes, new threshold 6 (max 6)
- age   1:     542824 bytes,     542824 total
- age   2:       3744 bytes,     546568 total
- age   3:       2848 bytes,     549416 total
- age   4:        624 bytes,     550040 total
- age   5:        440 bytes,     550480 total
- age   6:       4576 bytes,     555056 total
: 136793K->772K(153344K), 7.9592690 secs] 287927K->152081K(1031552K), 7.9669336 secs] [Times: user=4.82 sys=2.76, real=7.96 secs]

And sometimes I get this one in the middle of the others:

2019-01-02T14:37:01.132+0100: 93486.254: [GC (Allocation Failure) 2019-01-02T14:37:01.139+0100: 93486.261: [ParNew
Desired survivor size 8716288 bytes, new threshold 6 (max 6)
- age   1:     542824 bytes,     542824 total
- age   2:       3744 bytes,     546568 total
- age   3:       2848 bytes,     549416 total
- age   4:        624 bytes,     550040 total
- age   5:        440 bytes,     550480 total
- age   6:       4576 bytes,     555056 total
: 136793K->772K(153344K), 7.9592690 secs] 287927K->152081K(1031552K), 7.9669336 secs] [Times: user=4.82 sys=2.76, real=7.96 secs]

Queries may last up to 10 seconds, where 30-500ms was enough before this problem.

Did you run GET /_nodes/hot_threads while the CPU was at 100%? The output above indicates that the node is completely idle, and the GC log also doesn't look like it came from a busy node.

Are you sure it's the Elasticsearch process consuming CPU? In your original post you said you saw the "load average" at 100%. Load average is a system-wide measure. If you run top do you see Elasticsearch itself at 100% CPU?

1 Like

I think I did not wait enough time between the moment I stopped all server services except ES and the moment I checked the load average. When all my services are up, I have :

hot_threads:

::: {sHsHRsf}{sHsHRsflTtigKOko_FvnVg}{2IQyThMHQF23eoouaLsa7w}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8165040128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2019-01-03T16:35:06.487, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   31.4% (156.8ms out of 500ms) cpu usage by thread 'elasticsearch[sHsHRsf][search][T#4]'
     3/10 snapshots sharing following 32 elements
       org.apache.lucene.index.TermContext.build(TermContext.java:99)
       org.apache.lucene.search.TermQuery.createWeight(TermQuery.java:211)
       org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:751)
       org.elasticsearch.search.internal.ContextIndexSearcher.createWeight(ContextIndexSearcher.java:148)
       org.apache.lucene.search.BooleanWeight.<init>(BooleanWeight.java:54)
       org.apache.lucene.search.BooleanQuery.createWeight(BooleanQuery.java:207)
       org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:751)
...
top - 17:38:22 up 41 days,  7:14,  1 user,  load average: 7.38, 7.13, 6.68
Tasks: 117 total,   3 running,  73 sleeping,   0 stopped,   0 zombie
%Cpu(s): 43.0 us, 53.8 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.7 si,  2.2 st
KiB Mem :  7973672 total,   132904 free,  5420600 used,  2420168 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  2142024 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
27523 elastic+  20   0 8124680 2.521g 1.205g S  81.4 33.2 730:12.18 java                                                                        
12240 postgres  20   0  328300 154316 149232 R  20.8  1.9  28:08.07 postgres                                                                    
10225 postgres  20   0  328796 155664 149860 R  19.5  2.0  50:05.82 postgres                                                                    
 9563 postgres  20   0  327540 154192 149624 S  18.2  1.9  56:18.46 postgres                                                                    
31575 me        20   0 1330936 558840   8400 S  11.6  7.0 378:39.91 ruby                                                                        
 9988 postgres  20   0  318268  24964  23024 S   6.9  0.3  12:02.92 postgres                                                                    
16391 me        20   0   44560   4024   3388 R   1.9  0.1   0:04.22 top                                                                         
 9951 www-data  20   0  172812  11504   6036 S   0.9  0.1  11:38.94 nginx

Postgres is probably the culprit. I will dig on this way. Thank you very much everyone for your time. I will update this thread with my solution once I get one—perhaps it will be PG optimization or a simple server upgrade.

I tried a reboot, and ES still consumes 100% CPU despite no other active process

Tasks: 107 total,   1 running,  62 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us, 53.6 sy,  0.0 ni, 42.8 id,  2.9 wa,  0.0 hi,  0.3 si,  0.2 st
KiB Mem :  7973672 total,  6701384 free,   921216 used,   351072 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6803540 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  801 elastic+  20   0 1549928 790636   9368 S  99.3  9.9   2:01.39 java
  971 www-data  20   0  172764  11640   7028 S   3.7  0.1   0:00.75 nginx
  340 root      20   0       0      0      0 S   1.7  0.0   0:00.31 jbd2/sda1-8
  882 root      20   0  443560  21348   9712 S   0.7  0.3   0:03.02 fail2ban-server

Could you grab the hot threads output and the GC logs and then confirm again that it's still in this state?

1 Like
top - 11:20:52 up 13:00,  1 user,  load average: 0.45, 0.56, 0.54
Tasks:  99 total,   1 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s): 19.1 us,  9.8 sy,  0.0 ni, 69.7 id,  0.0 wa,  0.0 hi,  0.4 si,  1.0 st
KiB Mem :  7973672 total,  1004596 free,  3067884 used,  3901192 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4599532 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
 2718 elastic+  20   0 9184920 2.402g  85876 S  75.1 31.6  58:41.00 java                                                                        
10961 root      20   0       0      0      0 I   5.0  0.0   0:01.62 kworker/u4:2                                                                
 3611 me        20   0 1315028 205972   9068 S   1.7  2.6   5:24.45 ruby                                                                        
 4050 www-data  20   0  172344  10600   5576 S   1.3  0.1   5:36.76 nginx                                                                       
   16 root      20   0       0      0      0 S   0.3  0.0   0:12.99 ksoftirqd/1                                                                 
11221 me        20   0   44536   4044   3424 R   0.3  0.1   0:01.27 top

Hot threads seems empty:

curl localhost:9200/_nodes/hot_threads?pretty
::: {sHsHRsf}{sHsHRsflTtigKOko_FvnVg}{gb41s4T6Tc-vQb1808EXZA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=8165040128, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
   Hot threads at 2019-01-04T10:22:38.277, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
sudo tail -f /var/log/elasticsearch/gc.log.0.current
2019-01-04T11:22:37.924+0100: 44205.612: Total time for which application threads were stopped: 0.0040596 seconds, Stopping threads took: 0.0038789 seconds
2019-01-04T11:22:37.947+0100: 44205.636: Total time for which application threads were stopped: 0.0068627 seconds, Stopping threads took: 0.0066099 seconds
2019-01-04T11:22:38.911+0100: 44206.599: Total time for which application threads were stopped: 0.0022615 seconds, Stopping threads took: 0.0020801 seconds
2019-01-04T11:22:38.946+0100: 44206.634: Total time for which application threads were stopped: 0.0018438 seconds, Stopping threads took: 0.0015544 seconds
2019-01-04T11:22:38.959+0100: 44206.647: Total time for which application threads were stopped: 0.0019210 seconds, Stopping threads took: 0.0016778 seconds
2019-01-04T11:22:38.972+0100: 44206.660: Total time for which application threads were stopped: 0.0003531 seconds, Stopping threads took: 0.0001227 seconds
2019-01-04T11:22:38.983+0100: 44206.671: Total time for which application threads were stopped: 0.0005231 seconds, Stopping threads took: 0.0003038 seconds
2019-01-04T11:22:38.999+0100: 44206.687: Total time for which application threads were stopped: 0.0014472 seconds, Stopping threads took: 0.0012186 seconds
2019-01-04T11:22:39.011+0100: 44206.699: Total time for which application threads were stopped: 0.0002256 seconds, Stopping threads took: 0.0000594 seconds

(and it does not process anything during the tail -f)

top again after this:

top - 11:25:59 up 13:05,  1 user,  load average: 0.31, 0.46, 0.50
Tasks:  98 total,   2 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.1 us, 20.1 sy,  0.0 ni, 66.0 id,  0.0 wa,  0.0 hi,  0.6 si,  0.2 st
KiB Mem :  7973672 total,  1001344 free,  3068556 used,  3903772 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4598860 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                     
 2718 elastic+  20   0 9184920 2.402g  85876 S  90.7 31.6  59:22.02 java                                                                        
    7 root      20   0       0      0      0 S   0.3  0.0   0:08.74 ksoftirqd/0                                                                 
   16 root      20   0       0      0      0 S   0.3  0.0   0:13.29 ksoftirqd/1                                                                 
  882 root      20   0  447792  20860   6804 S   0.3  0.3  13:14.11 fail2ban-server                                                             
11260 root      20   0       0      0      0 I   0.3  0.0   0:00.64 kworker/u4:1                                                                
    1 root      20   0  225216   8840   6568 S   0.0  0.1   4:08.53 systemd                                                                     
    2 root      20   0       0      0      0 S   0.0  0.0   0:04.01 kthreadd

I'm struggling to think what the JVM could be doing if it's not GC and it's not reporting any activity on actual threads. Could you confirm the process is still in this state and try a jstack dump?

1 Like
top - 13:56:52 up 1 day, 15:36,  1 user,  load average: 0.50, 0.43, 0.44
Tasks:  97 total,   1 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s): 13.3 us,  6.7 sy,  0.0 ni, 79.4 id,  0.0 wa,  0.0 hi,  0.2 si,  0.3 st
KiB Mem :  7973672 total,   754036 free,  3134616 used,  4085020 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4532692 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2718 elastic+  20   0 9191088 2.403g  86000 S  45.2 31.6 167:56.23 java
14729 me        20   0  107988   3408   2396 S   0.7  0.0   0:00.20 sshd
14801 me        20   0   42784   3948   3352 R   0.3  0.0   0:02.18 top
    1 root      20   0  225216   8840   6568 S   0.0  0.1  12:10.63 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:34.40 kthreadd

Server is still in a kind of idle state except for ES (nothing changed since my previous message)

jstack -F <pid>:

Hi,

There is a common pattern in all top outputs that you've pasted, which is %sy in %Cpus being high in relation to %us, e.g:

As per top docs:

sy, system : time running kernel processes

Also in High CPU usage / load average while no running queries there's one entry showing very high time spent on stopped threads (0.66s):

2019-01-02T14:35:49.870+0100: 93414.997: Total time for which application threads were stopped: 0.6677784 seconds, Stopping threads took: 0.1241825 seconds

Given that your machine is a VPS (albeit using some local SSD) it makes sense to check the performance of the IO subsystem, ideally at the same time that you are checking CPU utilization. A couple of handy commands you can use to collect 1s samples for 5 mins for IO/CPU are:

TZ=UTC iostat -t -m -d -y -x -g ALL 1 300 >> iostat_$(hostname).txt
TZ=UTC mpstat -P ALL 1 300 >> mpstat_$(hostname).txt

However, it's highly recommended to use a structured methodology to identify and analyze potential system bottlenecks, such as the USE Method e.g. under CPU and Storage Device Utilization and Saturation to begin with.

3 Likes

Thanks everyone for your help

Top of the day:

top - 16:02:12 up 3 days, 17:41,  3 users,  load average: 0.31, 0.39, 0.42
Tasks: 111 total,   1 running,  71 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.0 us,  0.7 sy,  0.0 ni, 97.8 id,  0.5 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7973672 total,   650964 free,  3299880 used,  4022828 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  4263848 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                    
25273 elastic+  20   0 4750452 2.307g  27604 S   2.0 30.3 148:42.89 java
  340 root      20   0       0      0      0 S   0.7  0.0  29:58.99 jbd2/sda1-8
    8 root      20   0       0      0      0 I   0.3  0.0  12:22.90 rcu_sched
13953 me        20   0  107988   3488   2476 S   0.3  0.0   0:00.69 sshd
14460 root      20   0       0      0      0 I   0.3  0.0   0:06.70 kworker/u4:1
14716 me        20   0   42768   4008   3408 R   0.3  0.1   0:02.49 top

Despite the lowered load average, ssh session still lags and I am still unable to run anything properly.

I did not know the USE Method. I will give it a look and reach OVH customer support, perhaps the problem is broader than my VM.

I finaly found: my default server configuration has vm.overcommit_ratio = 50 and vm.overcommit_memory = 2. I do not understand how exactly it slowed down everything (perhaps a memory threshold since my last index update?), but changing this to vm.overcommit_memory = 0 solved everything.

I am sorry everyone, ElasticSearch was perfectly fine. I really appreciate that you provided a lot meaningful help to me during my dispair, I learned a couple of things about sysadmin but also how to inspect ElasticSearch, it will help me for sure in the future.

Thank you all!

1 Like

You're welcome @ES-beginner, I learned something too :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.