Very High Cpu Usage for search

I have a cluster (ES version 7.0.3) with 3 nodes with ubuntu servers and i have not declare any node as a master or data node, by default it elect one master and other data nodes itself. And i have only one index on this cluster with 1 primary shard and 1 replica of this index. on primary shard have 480GB data and on replica shard have 480GB data.

This application is running from last 3 year but data is increasing from 50 GB to 480 GB in last 3 years.
and each server have 8 CPU and 16GB RAM. with setting refresh_interval = 2s and number_of_shards = 1 and number_of_replicas=1

when my application was down due to traffic load then i have upgraded the all three server from 8 Core and 16GB RAM to 16 Core and 32GB RAM. Additionally i have added more traffic on it after upgradation, Now application is working fine.

First Cluster Config Details

Node 1 config
	cluster.name: es-cluster-mi
	node.name: master-01-mi
	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch
	network.host: [_local_,_site_]
	http.port: 9200
	discovery.seed_hosts: ["17.17.17.1", "17.17.17.2", "17.17.17.3"]
	cluster.initial_master_nodes: ["17.17.17.1"]

Node 2 Config

cluster.name: es-cluster-mi
	node.name: data-01-mi
	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch
	network.host: [_local_,_site_]
	http.port: 9200
	discovery.seed_hosts: ["17.17.17.1", "17.17.17.2", "17.17.17.3"]
	cluster.initial_master_nodes: ["17.17.17.1"]

Node 3 Config

cluster.name: es-cluster-mi
	node.name: data-02-mi
	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch
	network.host: [_local_,_site_]
	http.port: 9200
	discovery.seed_hosts: ["17.17.17.1", "17.17.17.2", "17.17.17.3"]
	cluster.initial_master_nodes: ["17.17.17.1"]

On other side i have decide to create proper Elasticsearch cluster ( version 7.0.3 ) with 3 master and 3 data nodes.
each master size is 4 CORE and 8 GB RAM and each data node size is 8 CORE and 16GB RAM.

But now when i move the production application traffic on New cluster then application increase the response time and CPU usage hit the 100% on all data node.

And i am handling the all master from the Load balancer and use the nginx on all master server because of the ALB not support the username and password. Below are the config on New Cluster.
I have modified setting of New Es refresh_interval = 40s and number_of_shards = 10 and number_of_replicas=1

New Cluster Config

Master 1
	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-master-01

	node.master: true
	node.data: false

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200

	thread_pool:
	    search:
	        size: 15
	        queue_size: 1500
	        min_queue_size: 1000
	        max_queue_size: 2000

	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]
Master 2 
	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-master-02

	node.master: true
	node.data: false

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200

	thread_pool:
	    search:
	        size: 15
	        queue_size: 1500
	        min_queue_size: 1000
	        max_queue_size: 2000

	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]
Master 3
	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-master-03

	node.master: true
	node.data: false

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200

	thread_pool:
	    search:
	        size: 15
	        queue_size: 1500
	        min_queue_size: 1000
	        max_queue_size: 2000

	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]
Data Node 1
	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-data-01

	node.master: false
	node.data: true

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200

	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]

	thread_pool:
	    search:
	        size: 30
	        queue_size: 3000
	        min_queue_size: 3000
	        max_queue_size: 5000
Data Node 2

	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-data-02

	node.master: false
	node.data: true

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200

	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]

	thread_pool:
	    search:
	        size: 30
	        queue_size: 3000
	        min_queue_size: 3000
	        max_queue_size: 5000
data Node 3
	cluster.name: new-es-cluster-mi-01
	node.name: new-es-cluster-data-03

	node.master: false
	node.data: true

	path.data: /var/lib/elasticsearch
	path.logs: /var/log/elasticsearch

	network.host: 0.0.0.0
	http.port: 9200
	thread_pool:
	    search:
	        size: 30
	        queue_size: 3000
	        min_queue_size: 3000
	        max_queue_size: 5000
	 
	discovery.seed_hosts: ["17.18.18.1", "17.18.18.2", "17.18.18.3", "17.18.18.4", "17.18.18.5", "17.18.18.6"]
	cluster.initial_master_nodes: ["17.18.18.1", "17.18.18.2", "17.18.18.3"]

Please Provide me a solution for high CPU usage.

If you create anew cluster you should do this using at least the latest 7.17 release. The version you are running is extremely old and has been EOL a long time.

In version 7.17 you can use the security features provided with the free Basic license to secure the cluster and do not need to use nginx for security anymore.

Also note that all requests should be sent to the data nodes and not any of the dedicated master nodes.

How have you arrived at these values? The default values in Elasticsearch are quite good, so I would recommend you remove these settings. Increasing the threadpool size can cause a lot of problems if you do not have enough CPU cores to support it and could very well be a factor in your high CPU usage.

We have moved on version 7.0.3 because current production version is same.

To verify the master health so we have implement the Nginx on cluster with api endpoint "/elasticsearch_health" on port 80 and other request normally come "/" directory.

server {
    listen 80;

    location /elasticsearch_health {
        proxy_pass https://es-master-1.com:9200/_cat/health;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        proxy_intercept_errors on;
        error_page 404 =500 @handle;
        proxy_set_header Authorization "Basic ZWxhc3RpkU0M1cWpTMnliT1I1eA==";
    }


    location / {
        proxy_pass https://es-master-1.com:9200/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        error_page 404 =500 @handle;

        proxy_set_header Authorization $http_authorization;
    }

    location @handle {
        if ($upstream_http_some_header = "success") {
            return 200;
        }
        return 500;
    }
}

We have implement the ALB for avoiding the single point of failure with multiple master. suppose one master will be down then the ALB will remove this master instance from target group and request will redirect to other Master.

Suppose i will add the request on any data node and data node will be down then all coming request will rejected.

And I have changed the default setting of thread_pool because lots of search request was rejected.

For security, we are using xpack

Please help me to solve the problem.

I recommend the following:

  • Upgrade to Elasticsearch 7.17 and enable the security features available as part of the free Basic license. A lot of performance and stability improvements have been made since the very old version you are currently using.
  • Make sure all search and index requests are routed directly to one of the data nodes and not the dedicated master nodes.
  • Remove the setting for thread pool size as this is likely to reduce performance rather than increase it. Note that this is one of the changes that differ between the environment with issues and the one without. You may keep the increased queue sizes, but be aware that may lead to increased heap pressure and does not really make queries any faster.
  • If queries are slow and the queues keep filling up, try to identify what is causing this. Is it expensive queries? Are you using slow storage (very common bottleneck) that has become the the limiting factor? Are your shards too large or too small?

I will remove the thread_pool setting but there is any other tuning setting for high cpu usage.

Previously cluster has 480GB data on 1 primary shard and 480GB on 1 replica. Is it good to have more Size on 1 shards?

Index Setting

{
  "order" : {
    "settings" : {
      "index" : {
        "routing" : {
          "allocation" : {
            "include" : {
              "_tier_preference" : "data_content"
            }
          }
        },
        "refresh_interval" : "2s",
        "number_of_shards" : "1",
        "max_result_window" : "50000",
        "number_of_replicas" : "1",
        "uuid" : "NMrZq-SoTJ2m7-jiSmN3Jg",
        "version" : {
          "created" : "7130199"
        }
      }
    }
  }
}

In New Cluster, I have total primary shards is 10 and 1 replica of it. means total shards is 20 and each shards have 48GB data size. it is good size on all shards ?

New Cluster Index Setting

{
  "orders": {
    "settings": {
      "index": {
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_content"
            }
          }
        },
        "refresh_interval": "40s",
        "number_of_shards": "10",
        "max_result_window": "40000",
        "creation_date": "1703067739814",
        "number_of_replicas": "1",
        "uuid": "V_zUIhRmQ-W6x-E2NFOdHg",
        "version": {
          "created": "7130199"
        }
      }
    }
  }
}

That sounds a lot better than the 480GB in the first cluster.If this gives acceptable query performance this sounds like a good shard size.

It seems like you have increased this setting as well. Be aware that this also leads to increased heap usage. Given all these changes affecting heap usage I would recommend you monitor heap usage and check the Elasticsearch logs for any evidence of long or frequent GC. If you see this you may need to reduce the settings or increase the heap size.

That depends on what is causing the high CPU usage.

What type of storage are you using? What does I/O stats look like, e.g. await and disk utilisation?

What type of queries are you running? Are you running anything very expensive like e.g. wildcard queries? What does your data and mappings look like?

Are your queries returning a large result set which results in lots of small disk I/O to gether the results?

Attaching some load stats.


We are using SSD - GP3

Please have a. look and suggest the things to do. Yes we are using wildcard query

Looks like you are having disk I/O issues so you may want to try to improve disk performance, e.g. by adding provisioned IOPS.

Wildcard queries are the most inefficient queries you can run in Elasticsearch, especially leading wildcard queries are very, very expensive. If you are using this a lot I am not surprised you are seeing high CPU usage.

You can try to capture the output of the hot threads API while CPU usage is high and this should give you an idea what the cause is.

I would recommend trying to improve the efficiency of your queries. What type of fields are you running wildcard queries on? If it is large keyword fields you may want to consider the wildcard field type.

I have increased with RAM from 16GB to 32GB but not upgrade the CPU (8 core).
And i have not changed in JVM Option value means after increasing the RAM on machine but not increased the jvm.option value (still 8GB Heap Size).

All problem is sorted out. Max CPU is 35% and max Disk IOPs is 1400 and max I/O Utilization 35%.

Mentioned Graph is last 24 Hours


1 Like