High CPU utilization on two nodes due to DSL Queries and slower query response

Hi,

I am using elastic version 7.11.2. I am observing that two nodes that contain an index of 102Gb (51+51), multiple DSL queries are running in the background resulting in high CPU utilization.

Enabled autoscaling on this cluster - but it doesn’t seem to autoscale.

Any thoughts on the above will be highly appreciable on how to mitigate the slow DSL query response and reduce CPU load.

Adding below logs of nodes,os,process,gc,hot_threads for ref.

GET /_cat/nodes?v=true
ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
IP1           45         100 100    4.64    4.31     4.22 himrst    -      instance-0
IP2           75         100 100    2.98    2.96     2.81 himrst    *      instance-1


GET _cat/shards
index_name                       0 r STARTED 153619336 51.1gb IP1  instance-0
index_name                       0 p STARTED 153619336   51gb IP2   instance-1

GET _nodes/stats
 "os" : {
        "timestamp" : 1640075412812,
        "cpu" : {
          "percent" : 100,
          "load_average" : {
            "1m" : 5.09,
            "5m" : 4.82,
            "15m" : 4.66
          }
        },
        "mem" : {
          "total_in_bytes" : 8589934592,
          "free_in_bytes" : 53248,
          "used_in_bytes" : 8589881344,
          "free_percent" : 0,
          "used_percent" : 100
        },
"process" : {
        "timestamp" : 1640075412894,
        "open_file_descriptors" : 1519,
        "max_file_descriptors" : 1048576,
        "cpu" : {
          "percent" : 3,
          "total_in_millis" : 1682949290
        },
        "mem" : {
          "total_virtual_in_bytes" : 102227439616
        }
      },
"jvm" : {
        "timestamp" : 1640075412902,
        "uptime_in_millis" : 1519168014,
        "mem" : {
          "heap_used_in_bytes" : 2019352576,
          "heap_used_percent" : 47,
          "heap_committed_in_bytes" : 4294967296,
          "heap_max_in_bytes" : 4294967296,
          "non_heap_used_in_bytes" : 307331792,
          "non_heap_committed_in_bytes" : 316825600,
          "pools" : {
            "young" : {
              "used_in_bytes" : 406847488,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 2573207552,
              "peak_max_in_bytes" : 0
            },
            "old" : {
              "used_in_bytes" : 1606213632,
              "max_in_bytes" : 4294967296,
              "peak_used_in_bytes" : 3262963712,
              "peak_max_in_bytes" : 4294967296
            },
            "survivor" : {
              "used_in_bytes" : 6291456,
              "max_in_bytes" : 0,
              "peak_used_in_bytes" : 322961408,
              "peak_max_in_bytes" : 0
            }
          }
        },
"gc" : {
          "collectors" : {
            "young" : {
              "collection_count" : 265734,
              "collection_time_in_millis" : 41312494
            },
            "old" : {
              "collection_count" : 0,
              "collection_time_in_millis" : 0
            }
          }
        },


::: {instance-0}{<..>}{<..>}{<..>}{<..>}{himrst}{logical_availability_zone=<>, server_name=instance-0.<>, availability_zone=<>, xpack.installed=true, data=hot, instance_configuration=<>, transform.node=true, region=<>}
   Hot threads at 2021-12-21T08:20:30.432Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   27.9% (139.5ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0][search][T#3]'
     5/10 snapshots sharing following 33 elements
       app//org.elasticsearch.common.util.LongLongHash.add(LongLongHash.java:129)
       app//org.elasticsearch.search.aggregations.bucket.terms.LongKeyedBucketOrds$FromMany.add(LongKeyedBucketOrds.java:199)

 27.1% (135.7ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0][search][T#2]'
     3/10 snapshots sharing following 29 elements
       app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:253)
       app//org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc(DisjunctionDISIApproximation.java:55)
       app//org.apache.lucene.search.ConjunctionDISI.nextDoc(ConjunctionDISI.java:253)
       app//org.apache.lucene.search.Weight$DefaultBulkScorer.scoreRange(Weight.java:254)

24.4% (121.9ms out of 500ms) cpu usage by thread 'elasticsearch[instance-0][search][T#1]'
     2/10 snapshots sharing following 60 elements
       java.base@15.0.1/sun.nio.ch.FileDispatcherImpl.pread0(Native Method)
       java.base@15.0.1/sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:54)
       java.base@15.0.1/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:274)
       java.base@15.0.1/sun.nio.ch.IOUtil.read(IOUtil.java:245)
       java.base@15.0.1/sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:815)
       java.base@15.0.1/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:800)
       app//org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:170)
       app//org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:152)
       app//org.apache.lucene.store.DataInput.skipBytes(DataInput.java:338)

 3.5% (17.6ms out of 500ms) cpu usage by thread 'elasticsearch[instance-1][snapshot][T#1]'
     10/10 snapshots sharing following 59 elements
       java.base@15.0.1/sun.nio.ch.Net.poll(Native Method)

Even if I check simple search queries in dev tools taking a lot of time.

GET index_name/_search
{
  "took" : 32677,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  }

What is the output from the _cluster/stats?pretty&human API?

What is the query?

Hi @warkolm
Pls find _cluster/stats?pretty&human response-

 "cluster_name" : "<>",
  "cluster_uuid" : "<>",
  "timestamp" : 1640150654505,
  "status" : "green",
  "indices" : {
    "count" : 128,
    "shards" : {
      "total" : 256,
      "primaries" : 128,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 434460296,
      "deleted" : 38522054
    },
    "store" : {
      "size" : "294.2gb",
      "size_in_bytes" : 315969678734,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "1.6mb",
      "memory_size_in_bytes" : 1730224,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "280.6mb",
      "memory_size_in_bytes" : 294322457,
      "total_count" : 376826998,
      "hit_count" : 40385594,
      "miss_count" : 336441404,
      "cache_size" : 3008,
      "cache_count" : 11621,
      "evictions" : 8613
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 709,
      "memory" : "10.1mb",
      "memory_in_bytes" : 10608184,
      "terms_memory" : "6.8mb",
      "terms_memory_in_bytes" : 7147312,
      "stored_fields_memory" : "459.2kb",
      "stored_fields_memory_in_bytes" : 470264,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "853.5kb",
      "norms_memory_in_bytes" : 873984,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "2mb",
      "doc_values_memory_in_bytes" : 2116624,
      "index_writer_memory" : "19.7mb",
      "index_writer_memory_in_bytes" : 20715608,
      "version_map_memory" : "0b",
      "version_map_memory_in_bytes" : 0,
      "fixed_bit_set" : "18kb",
      "fixed_bit_set_memory_in_bytes" : 18504,
      "max_unsafe_auto_id_timestamp" : 1639624332491,
      "file_sizes" : { }
    },
      "mem" : {
        "total" : "26.9gb",
        "total_in_bytes" : 28989980672,
        "free" : "4.8gb",
        "free_in_bytes" : 5194235904,
        "used" : "22.1gb",
        "used_in_bytes" : 23795744768,
        "free_percent" : 18,
        "used_percent" : 82
      }
    },
 "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [
        {
          "name" : "whitespace",
          "count" : 1,
          "index_count" : 1
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.11.2",
        "index_count" : 128,
        "primary_shard_count" : 128,
        "total_primary_size" : "147.1gb",
        "total_primary_bytes" : 157999173510
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 6,
      "coordinating_only" : 0,
      "data" : 0,
      "data_cold" : 1,
      "data_content" : 2,
      "data_hot" : 2,
      "data_warm" : 2,
      "ingest" : 2,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 5,
      "transform" : 2,
      "voting_only" : 1
    },
    "versions" : [
      "7.11.2"
    ],
    "os" : {
      "available_processors" : 12,
      "allocated_processors" : 12,
      "names" : [
        {
          "name" : "Linux",
          "count" : 6
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "CentOS Linux 8",
          "count" : 6
        }
      ],
      "mem" : {
        "total" : "26.9gb",
        "total_in_bytes" : 28989980672,
        "free" : "4.8gb",
        "free_in_bytes" : 5194235904,
        "used" : "22.1gb",
        "used_in_bytes" : 23795744768,
        "free_percent" : 18,
        "used_percent" : 82
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 4
      },
      "open_file_descriptors" : {
        "min" : 979,
        "max" : 5446,
        "avg" : 1952
      }
    },
    "jvm" : {
      "max_uptime" : "18.4d",
      "max_uptime_in_millis" : 1594710489,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 6
        }
      ],
      "mem" : {
        "heap_used" : "6.4gb",
        "heap_used_in_bytes" : 6912917528,
        "heap_max" : "12.6gb",
        "heap_max_in_bytes" : 13530824704
      },
      "threads" : 375
    },
    "fs" : {
      "total" : "2tb",
      "total_in_bytes" : 2293342666752,
      "free" : "1.7tb",
      "free_in_bytes" : 1976492556288,
      "available" : "1.7tb",
      "available_in_bytes" : 1976492556288
    },
    "plugins" : [
      {
        "name" : "repository-s3",
        "version" : "7.11.2",
        "elasticsearch_version" : "7.11.2",
        "java_version" : "1.8",
        "description" : "The S3 repository plugin adds S3 repositories",
        "classname" : "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
        "extended_plugins" : [ ],
        "has_native_controller" : false,
        "licensed" : false,
        "type" : "isolated"
      }
    ],

Mostly I am using complex aggregation queries running in the background and fetching data from the index varies from 1d to 90d.
I guess ILM policy needs to be config on the index for faster execution of DSL queries, currently, it's not configured.

Sample query-

GET /index_name/_count
{
  "query": {
    "range": {
        "@timestamp": {
            "format": "strict_date_optional_time",
             "lt": "2021-10-30"
        }
    }
  }
}

response-

{
  "count" : 34299790,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0,
    "took": 14309 
  }
}

It looks like you are using very small heaps. Is there anything in the logs on these nodes that indicate slow or frequent garbage collection? What does iowait look like on these nodes? What type of storage are you using? Local SSDs?

Hi @Christian_Dahlqvist

Nope, But multiple complex DSL queries querying index simultaneously (in the background)increases CPU load and results in slower processing.
Btw all 100+ Gb of index data is in the "hot tier", No ILM is assigned

I attached /_nodes/stats in 1st msg, pls let me know if you need stats of individual nodes,both are having 100%cpu load.

It's hosted on elastic.co.
Hot Data and Content tier:
Size per zone:240Gb Storage and 8 Gb Ram
Availability zone: 2 Zones

Autoscaling is enabled but I am not hitting max threshold storage limits, it's not working and hitting 100% CPU load on 100 Gb Storage itself.

Maybe I need o to add more nodes or downgrade the existing Node size to 60Gb or 30 Gb from 240 Gb as it's not autoscaling.

@stephenb , maybe you can throw some light here in first 2 queries.
(sorry for pinging you in the post, but I think it's better to get queries resolved in a single post rather than creating a new post).
Queries I have-

  • Can I add an additional node manually to the existing cluster of small storage size 60Gb (so that it can autoscale) and with CPU V2.2 (each node) for performance optimization?

  • Can I downscale my current cluster as it's not autoscaling so that once "storage" hits a threshold, it will start autoscaling?

  • If I use lower VM (30GB, 2.2cpu, 1GBram) during autoscaling new nodes will get created, will it increase no. of shards for an index so that they can get distributed to new nodes?
    ans-primary shards will not increase, we can add more replica shards.

Currently, we are having 2 nodes with with1 primary and 1 replica shard in each node. Both nodes are running with 100% CPU and ram and are unable to autoscale as "storage space" is not hitting the threshold.

As per my understanding autoscaling is based on the "storage size" of a cluster.

Links I refer-Autoscaling example

Any suggestions for the ideal heap size for this cluster.

Thanks!!!

I did not get a chance to read this is great detail.

But here are 2 quick thoughts.

  1. Autoscaling today is only storage based... trying to solve a CPU Utilization issue with autoscaling today is probably a poor approach.

  2. Since this seems to be fairly low Storage volumes but High CPU Utilization perhaps you should consider a different node profile such as CPU Optimized where you would get a higher Ratio of CPU to Storage.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.