Elasticsearch Master CPU being used 100% at times

My elasticsearch active master node CPU usage rises to 100% at times and remains the same for few minutes/hours and then comes downs to normal 4-5%.

I ran the "hot_threads" API against the node and found below threads taking all the CPU time.

  Hot threads at 2023-03-16T04:45:50.211Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

   100.1% (500.3ms out of 500ms) cpu usage by thread 'elasticsearch[node-name-xxx][management][T#1]'
     10/10 snapshots sharing following 23 elements
       java.base@17/com.sun.crypto.provider.PBKDF2KeyImpl.<init>(PBKDF2KeyImpl.java:119)
       java.base@17/com.sun.crypto.provider.PBKDF2Core.engineGenerateSecret(PBKDF2Core.java:70)
       java.base@17/javax.crypto.SecretKeyFactory.generateSecret(SecretKeyFactory.java:340)
       org.elasticsearch.license.CryptUtils.deriveSecretKey(CryptUtils.java:188)
       org.elasticsearch.license.CryptUtils.decrypt(CryptUtils.java:136)
       org.elasticsearch.license.CryptUtils.decrypt(CryptUtils.java:116)
       org.elasticsearch.license.SelfGeneratedLicense.verify(SelfGeneratedLicense.java:73)
       org.elasticsearch.license.LicenseService.getLicense(LicenseService.java:558)
       org.elasticsearch.license.LicenseService.getLicense(LicenseService.java:548)
       org.elasticsearch.license.LicenseService.getLicense(LicenseService.java:347)
       org.elasticsearch.license.TransportGetLicenseAction.masterOperation(TransportGetLicenseAction.java:44)
       org.elasticsearch.license.TransportGetLicenseAction.masterOperation(TransportGetLicenseAction.java:23)
       app//org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:90)
       app//org.elasticsearch.action.support.master.TransportMasterNodeAction.executeMasterOperation(TransportMasterNodeAction.java:99)
       app//org.elasticsearch.action.support.master.TransportMasterNodeAction.access$400(TransportMasterNodeAction.java:48)
       app//org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:170)
       app//org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$$Lambda$5947/0x00000008019586c8.accept(Unknown Source)
       app//org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
       app//org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:737)
       app//org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
       java.base@17/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
       java.base@17/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
       java.base@17/java.lang.Thread.run(Thread.java:833)

version details =

"version" : {
    "number" : "7.15.1",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "83c34f456ae29d60e94d886e455e6a3409bba9ed",
    "build_date" : "2021-10-07T21:56:19.031608185Z",
    "build_snapshot" : false,
    "lucene_version" : "8.9.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"

Please suggest what can be done to fix this and/or find the underlying cause.

1 Like

What is the full output of the cluster stats API?

What is the hardware specification of the cluster? What type of hardware is it deployed on?

Version 7.15 is also a bit old so I would recommend you upgrade to at least to the latest 7.17 release.

Output of cluster stats API -

{"_nodes":{"total":33,"successful":33,"failed":0},"cluster_name":"cluster-prod-name-elk","cluster_uuid":"udjj8hC8Qr-QhArZbwFUeg","timestamp":1678946525600,"status":"green","indices":{"count":78,"shards":{"total":3090,"primaries":1545,"replication":1.0,"index":{"shards":{"min":2,"max":240,"avg":39.61538461538461},"primaries":{"min":1,"max":120,"avg":19.807692307692307},"replication":{"min":1.0,"max":1.0,"avg":1.0}}},"docs":{"count":64452063811,"deleted":1097062},"store":{"size_in_bytes":76169848727964,"total_data_set_size_in_bytes":76169848727964,"reserved_in_bytes":0},"fielddata":{"memory_size_in_bytes":24200,"evictions":0},"query_cache":{"memory_size_in_bytes":13102846776,"total_count":808307387,"hit_count":34352556,"miss_count":773954831,"cache_size":2099350,"cache_count":2790770,"evictions":691420},"completion":{"size_in_bytes":0},"segments":{"count":83697,"memory_in_bytes":625824948,"terms_memory_in_bytes":478960368,"stored_fields_memory_in_bytes":64919304,"term_vectors_memory_in_bytes":0,"norms_memory_in_bytes":62374784,"points_memory_in_bytes":0,"doc_values_memory_in_bytes":19570492,"index_writer_memory_in_bytes":9815839852,"version_map_memory_in_bytes":2389,"fixed_bit_set_memory_in_bytes":1621936,"max_unsafe_auto_id_timestamp":1678924807011,"file_sizes":{}},"mappings":{"field_types":[{"name":"boolean","count":39,"index_count":19,"script_count":0},{"name":"date","count":117,"index_count":51,"script_count":0},{"name":"float","count":58,"index_count":9,"script_count":0},{"name":"half_float","count":40,"index_count":10,"script_count":0},{"name":"integer","count":110,"index_count":5,"script_count":0},{"name":"keyword","count":890,"index_count":51,"script_count":0},{"name":"long","count":938,"index_count":51,"script_count":0},{"name":"nested","count":19,"index_count":9,"script_count":0},{"name":"object","count":803,"index_count":51,"script_count":0},{"name":"text","count":425,"index_count":46,"script_count":0},{"name":"version","count":4,"index_count":4,"script_count":0}],"runtime_field_types":[]},"analysis":{"char_filter_types":[],"tokenizer_types":[],"filter_types":[],"analyzer_types":[],"built_in_char_filters":[],"built_in_tokenizers":[],"built_in_filters":[],"built_in_analyzers":[]},"versions":[{"version":"7.15.1","index_count":78,"primary_shard_count":1545,"total_primary_bytes":38058168977373}]},"nodes":{"count":{"total":33,"coordinating_only":0,"data":30,"data_cold":0,"data_content":0,"data_frozen":0,"data_hot":0,"data_warm":0,"ingest":3,"master":3,"ml":0,"remote_cluster_client":0,"transform":0,"voting_only":0},"versions":["7.15.1"],"os":{"available_processors":486,"allocated_processors":486,"names":[{"name":"Linux","count":33}],"pretty_names":[{"pretty_name":"Amazon Linux 2","count":33}],"architectures":[{"arch":"aarch64","count":33}],"mem":{"total_in_bytes":1015636779008,"free_in_bytes":14066044928,"used_in_bytes":1001570734080,"free_percent":1,"used_percent":99}},"process":{"cpu":{"percent":1206},"open_file_descriptors":{"min":1053,"max":3670,"avg":3362}},"jvm":{"max_uptime_in_millis":27949967423,"versions":[{"version":"17","vm_name":"OpenJDK 64-Bit Server VM","vm_version":"17+35","vm_vendor":"Eclipse Adoptium","bundled_jdk":true,"using_bundled_jdk":true,"count":33}],"mem":{"heap_used_in_bytes":254698972600,"heap_max_in_bytes":507963768832},"threads":5187},"fs":{"total_in_bytes":103239549038592,"free_in_bytes":26135047434240,"available_in_bytes":26135047434240},"plugins":[],"network_types":{"transport_types":{"netty4":33},"http_types":{"netty4":33}},"discovery_types":{"zen":33},"packaging_types":[{"flavor":"default","type":"rpm","count":33}],"ingest":{"number_of_pipelines":2,"processor_stats":{"gsub":{"count":0,"failed":0,"current":0,"time_in_millis":0},"script":{"count":0,"failed":0,"current":0,"time_in_millis":0}}}}}

CLuster is running on AWS VM's.
3 master nodes = 2 CPU, 4 GB RAM
30 Data nodes = 16 CPU , 32 GB RAM

We can look for upgrade, but some concrete evendence on what is actuallt causing this will help.

What instance type are you using for the master nodes? Is it by any chance an instance type that relies on CPU credits, e.g. t2 or t3 instances?

no Christian, these are c6g series. there is no CPU steal time or something of that sort.

OK, good to rule that out. I do not think I have ever seen licensing processing in hot threads, so will need to leave this for someone else who has more familiarity with that area. I have never used that instance type, so do not know if there are any peculiarities related to it.

Hi @aayush_kumar Welcome to the community!

Interesting....

c6g's Are awesome machine types.. that said I have never seen licensing in hot threads either.

Assume you are running c6gd?

Looks like it's spending a lot of time decrypting the license which I have not seen before.

Did you recently do anything with the license?

What type of license do you run?

2 Likes

for master = c6g.large
for data = c6g.4xlarge

This is opensource elasticsearch and I am not using licence anywhere.

elasticsearch.yml

cluster.name: xxxxx
node.name: xxxx
path.data: /xx/elasticsearch
path.logs: /var/logs/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.zen.ping.unicast.hosts: ["xx","xx","xx"]
cluster.initial_master_nodes: ["xx","xx","xx"]
transport.host: xx
transport.tcp.port: 9300
thread_pool.search.queue_size: 10000

node.roles: [ master]

xpack.security.enabled: false

The last version of Elasticsearch under the Apache open source license was 7.10, so as you are running version 7.15 you are using the default distribution with the Basic license.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.