Management threads consuming high CPU on an idle cluster

Cluster details
100 data nodes, 3 dedicated master node
Data node JVM : 30 GB on each node
Date node cores: 64
Dedicated master core: 64
Data node JVM: 30 GB

Cluster data details
Total 500TB
Shards have a uniform size of 20GB
Total number of indices:3000, each with 5 primary and 1 replica
Total number of segments: 900,000
Mappings are close to 115MB, Cluster state is close to 158MB.

Requests on a data node- no indexing/no query, just management queries

[2017-10-06T04:36:53,017][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:53,053][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 36
[2017-10-06T04:36:53,056][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 0
[2017-10-06T04:36:53,066][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 9
[2017-10-06T04:36:53,066][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:36:55,568][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:55,607][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 38
[2017-10-06T04:36:55,610][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:36:55,624][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 14
[2017-10-06T04:36:55,625][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:36:58,127][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:36:58,165][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 37
[2017-10-06T04:36:58,168][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:36:58,183][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 15
[2017-10-06T04:36:58,184][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:00,686][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:37:00,724][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 37
[2017-10-06T04:37:00,727][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:37:00,742][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 15
[2017-10-06T04:37:00,743][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:03,247][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 0
[2017-10-06T04:37:03,286][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes filter_path=nodes.*.version%2Cnodes.*.http.publish_address%2Cnodes.*.ip 200 OK 11999 38
[2017-10-06T04:37:03,288][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_nodes/_local filter_path=nodes.*.settings.tribe 200 OK 2 1
[2017-10-06T04:37:03,302][INFO ][c.a.c.e.logger           ] [_qYdend] GET /_cluster/health/.kibana timeout=5s 200 OK 411 13
[2017-10-06T04:37:03,303][INFO ][c.a.c.e.logger           ] [_qYdend] POST /.kibana/config/_search - 403 FORBIDDEN 253 0
[2017-10-06T04:37:05,806][INFO ][c.a.c.e.logger           ] [_qYdend] HEAD / - 200 OK 345 1

There are no no gc warn/ info/debug logs.

Output of hot threads on multiple nodes(truncated due to limits)

100.8% (504ms out of 500ms) cpu usage by thread 'elasticsearch[rpg1dEX][management][T#2]'
     9/10 snapshots sharing following 38 elements                                   
       java.io.UnixFileSystem.canonicalize0(Native Method)                          
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)                 
       java.io.File.getCanonicalPath(File.java:618)                                 
       java.io.FilePermission$1.run(FilePermission.java:215)                        
       java.io.FilePermission$1.run(FilePermission.java:203)                        
       java.security.AccessController.doPrivileged(Native Method)                   
       java.io.FilePermission.init(FilePermission.java:203)                         
       java.io.FilePermission.<init>(FilePermission.java:277)                       
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)                
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)                             
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                          
       java.nio.file.Files.size(Files.java:2332)                                    
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)         
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSize(Store.java:1370)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1359)
       org.elasticsearch.index.store.Store$StoreStatsCache.refresh(Store.java:1346)
       org.elasticsearch.common.util.SingleObjectCache.getOrRefresh(SingleObjectCache.java:54)
       org.elasticsearch.index.store.Store.stats(Store.java:300)                
       org.elasticsearch.index.shard.IndexShard.storeStats(IndexShard.java:702) 
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:177)
org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:163)......
                                 
     unique snapshot                                                            
       sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)                     
       sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)      
       sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)            
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:52)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                      
       java.nio.file.Files.size(Files.java:2332)                                
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)     
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.......
100.8% (503.8ms out of 500ms) cpu usage by thread 'elasticsearch[y_Ge1tC][management][T#4]'
     2/10 snapshots sharing following 14 elements                                   
       org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:719)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
 org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:104)
org.elasticsearch.action.admin.cluster.stats.TransportClusterStatsAction.nodeOperation(TransportClusterStatsAction.java:53)......                                   
     2/10 snapshots sharing following 37 elements                                   
       java.io.UnixFileSystem.canonicalize0(Native Method)                          
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)                 
       java.io.File.getCanonicalPath(File.java:618)                                 
       java.io.FilePermission$1.run(FilePermission.java:215)                        
       java.io.FilePermission$1.run(FilePermission.java:203)                        
       java.security.AccessController.doPrivileged(Native Method)                   
       java.io.FilePermission.init(FilePermission.java:203)                         
       java.io.FilePermission.<init>(FilePermission.java:277)                       
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)                
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)                             
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)                          
       java.nio.file.Files.size(Files.java:2332)                                    
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)         
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.elasticsearch.index.store.Store$StoreStatsCache.estimateSiz .....
                                     
     unique snapshot                                                            
       sun.misc.Unsafe.park(Native Method)                                      
       java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)   
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:734)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
       java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1277)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:745).....

Which version of Elasticsearch are you using?

ES version is 5.3

@Igor_Motov Can you please take a look at it.

Thanks in advance!

This forum is manned by volunteers, so please do not ping people not already involved in the thread.

1 Like

What type of hardware are you running on? What type of storage?

Are these errors related to the hardware configurations? Why would I see these errors?
I am using SSD drives for storage.

My guess was that it was related to a slow file system, e.g. a slow or congested SAN, so I am a bit surprised if you are seeing this with locally attached SSDs.

Do you have any idea why this would happen?

Can you confirm that you are using locally attached SSDs? What does iostat -x report?

iostat -x
Linux 4.9.27-14.33.amzn1.x86_64 (ip-10-212-32-78) 10/08/2017 x86_64 (64 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
6.74 0.00 0.28 0.01 0.00 92.96

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
xvda 0.00 1.64 0.09 1.07 3.03 35.01 33.03 0.02 16.16 4.04 0.47
nvme0n1 0.22 7.80 17.02 25.71 4245.52 11408.52 366.34 0.10 2.25 0.19 0.79
nvme1n1 0.22 7.22 16.92 24.38 4243.19 11384.88 378.41 0.10 2.48 0.19 0.79
nvme2n1 0.22 7.15 16.89 24.20 4238.47 11377.55 380.04 0.10 2.37 0.19 0.79
nvme3n1 0.22 7.14 16.88 24.17 4233.38 11371.72 380.17 0.10 2.42 0.19 0.78
nvme4n1 0.22 7.14 16.86 24.32 4228.92 11367.81 378.68 0.10 2.33 0.19 0.78
nvme5n1 0.23 7.13 16.83 24.06 4220.99 11358.09 381.06 0.10 2.50 0.19 0.78
nvme6n1 0.23 7.15 16.80 24.03 4214.84 11352.03 381.20 0.10 2.34 0.19 0.78
nvme7n1 0.23 7.11 16.77 23.97 4207.35 11343.69 381.76 0.10 2.40 0.19 0.78

Do you think I should create an issue with elastic?

It might be worth waiting a few days to see if anyone else chimes in.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.