Management threads consuming high CPU on an idle cluster


(Itiyama Sadana) #1

Cluster details
100 data nodes, 3 dedicated master node
Data node JVM : 30 GB on each node
Date node cores: 64
Dedicated master core: 64
Data node JVM: 30 GB

Cluster data details
Total 500TB
Shards have a uniform size of 20GB
Total number of indices:3000, each with 5 primary and 1 replica
Total number of segments: 900,000
Mappings are close to 115MB, Cluster state is close to 158MB.

Requests on a data node- no indexing/no query, just management queries

[2017-10-06T04:36:53,017][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:53,053][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 36
[2017-10-06T04:36:53,056][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 0
[2017-10-06T04:36:53,066][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 9
[2017-10-06T04:36:53,066][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:36:55,568][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:55,607][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 38
[2017-10-06T04:36:55,610][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:36:55,624][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 14
[2017-10-06T04:36:55,625][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:36:58,127][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:58,165][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 37
[2017-10-06T04:36:58,168][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:36:58,183][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 15
[2017-10-06T04:36:58,184][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:00,686][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:37:00,724][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 37
[2017-10-06T04:37:00,727][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:37:00,742][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 15
[2017-10-06T04:37:00,743][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:03,247][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:37:03,286][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 38
[2017-10-06T04:37:03,288][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:37:03,302][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 13
[2017-10-06T04:37:03,303][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:05,806][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 1

There are no no gc warn/ info/debug logs.


(Itiyama Sadana) #2

Output of hot threads on multiple nodes(truncated due to limits)

100.8% (504ms out of 500ms) cpu usage by thread 'elasticsearch[rpg1dEX][management][T#2]'
     9/10 snapshots sharing following 38 elements                                   
       java.io.UnixFileSystem.canonicalize0(Native Method)                          
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)                 
       java.io.File.getCanonicalPath(File.java:618)                                 
       java.io.FilePermission$1.run(FilePermission.java:215)                        
       java.io.FilePermission$1.run(FilePermission.java:203)                        
       java.security.AccessController.doPrivileged(Native Method)                   
       java.io.FilePermission.init(FilePermission.java:203)                         
       java.io.FilePermission.<init>(FilePermission.java:277)                       
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)                
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)                             
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                          
       java.nio.file.Files.size(Files.java:2332)                                    
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)         
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1370)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1359)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1346)
       org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
       org.elasticsearch.index.store.Store.stats(Store.java:300)                
       org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:702) 
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:177)
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:163)......
                                 
     unique snapshot                                                            
       sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)                     
       sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)      
       sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)            
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:52)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                      
       java.nio.file.Files.size(Files.java:2332)                                
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)     
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.......
100.8% (503.8ms out of 500ms) cpu usage by thread 'elasticsearch[y_Ge1tC][management][T#4]'
     2/10 snapshots sharing following 14 elements                                   
       org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:719)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
 org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:104)
org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53)......                                   
     2/10 snapshots sharing following 37 elements                                   
       java.io.UnixFileSystem.canonicalize0(Native Method)                          
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)                 
       java.io.File.getCanonicalPath(File.java:618)                                 
       java.io.FilePermission$1.run(FilePermission.java:215)                        
       java.io.FilePermission$1.run(FilePermission.java:203)                        
       java.security.AccessController.doPrivileged(Native Method)                   
       java.io.FilePermission.init(FilePermission.java:203)                         
       java.io.FilePermission.<init>(FilePermission.java:277)                       
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)                
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)                             
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                          
       java.nio.file.Files.size(Files.java:2332)                                    
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)         
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSiz .....
                                     
     unique snapshot                                                            
       sun.misc.Unsafe.park(Native Method)                                      
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:734)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
       java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1277)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745).....

(Christian Dahlqvist) #3

Which version of Elasticsearch are you using?


(Itiyama Sadana) #4

ES version is 5.3


(Itiyama Sadana) #5

@Igor_Motov Can you please take a look at it.

Thanks in advance!


(Christian Dahlqvist) #6

This forum is manned by volunteers, so please do not ping people not already involved in the thread.


(Christian Dahlqvist) #7

What type of hardware are you running on? What type of storage?


(Itiyama Sadana) #8

Are these errors related to the hardware configurations? Why would I see these errors?
I am using SSD drives for storage.


(Christian Dahlqvist) #9

My guess was that it was related to a slow file system, e.g. a slow or congested SAN, so I am a bit surprised if you are seeing this with locally attached SSDs.


(Itiyama Sadana) #10

Do you have any idea why this would happen?


(Christian Dahlqvist) #11

Can you confirm that you are using locally attached SSDs? What does iostat -x report?


(Itiyama Sadana) #12

iostat -x
Linux 4.9.27-14.33.amzn1.x86_64 (ip-10-212-32-78) 10/08/2017 x86_64 (64 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
6.74 0.00 0.28 0.01 0.00 92.96

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 1.64 0.09 1.07 3.03 35.01 33.03 0.02 16.16 4.04 0.47
nvme0n1 0.22 7.80 17.02 25.71 4245.52 11408.52 366.34 0.10 2.25 0.19 0.79
nvme1n1 0.22 7.22 16.92 24.38 4243.19 11384.88 378.41 0.10 2.48 0.19 0.79
nvme2n1 0.22 7.15 16.89 24.20 4238.47 11377.55 380.04 0.10 2.37 0.19 0.79
nvme3n1 0.22 7.14 16.88 24.17 4233.38 11371.72 380.17 0.10 2.42 0.19 0.78
nvme4n1 0.22 7.14 16.86 24.32 4228.92 11367.81 378.68 0.10 2.33 0.19 0.78
nvme5n1 0.23 7.13 16.83 24.06 4220.99 11358.09 381.06 0.10 2.50 0.19 0.78
nvme6n1 0.23 7.15 16.80 24.03 4214.84 11352.03 381.20 0.10 2.34 0.19 0.78
nvme7n1 0.23 7.11 16.77 23.97 4207.35 11343.69 381.76 0.10 2.40 0.19 0.78


(Itiyama Sadana) #13

Do you think I should create an issue with elastic?


(Christian Dahlqvist) #14

It might be worth waiting a few days to see if anyone else chimes in.


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.