High CPU usage 100%

Hello,

I am using ELasticsearch and Kibana version 7.10.1

Since 3 days, I am noticing that my data nodes are using high CPU, 100%

in my cluster I have:
3 master nodes ( 2 CPU, 4 Go RAM, 100 Go HDD) node.roles: [ master ]
2 DATA nodes HOT ( 6 CPU, 8Go RAM, 400 Go SSD) node.roles: [ data_hot, data_content, ingest, transform ]
1 DATA node warm ( 6 CPU, 6 Go RAM, 3 To HDD) node.roles: [ data_warm, data_content, ingest, transform ]
1 ML node (6 CPU, 8 Go RAM, 200 Go HDD) node.roles: [ ml ]

In my data nodes I am seeing these output:

2021-01-11T17:40:33,320][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343145] overhead, spent [349ms] collecting in the last [1s]
[2021-01-11T17:40:36,323][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343148] overhead, spent [353ms] collecting in the last [1s]
[2021-01-11T17:40:37,323][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343149] overhead, spent [463ms] collecting in the last [1s]
[2021-01-11T17:40:38,324][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343150] overhead, spent [315ms] collecting in the last [1s]
[2021-01-11T17:40:56,412][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343168] overhead, spent [272ms] collecting in the last [1s]
[2021-01-11T17:40:59,491][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343171] overhead, spent [370ms] collecting in the last [1s]
[2021-01-11T17:41:00,643][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343172] overhead, spent [453ms] collecting in the last [1.1s]
[2021-01-11T17:41:01,649][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343173] overhead, spent [347ms] collecting in the last [1s]
[2021-01-11T17:41:09,738][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343181] overhead, spent [281ms] collecting in the last [1s]
[2021-01-11T17:41:14,840][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343186] overhead, spent [333ms] collecting in the last [1s]
[2021-01-11T17:41:23,039][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343194] overhead, spent [289ms] collecting in the last [1s]
[2021-01-11T17:41:30,163][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343201] overhead, spent [349ms] collecting in the last [1.1s]
[2021-01-11T17:41:31,163][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343202] overhead, spent [327ms] collecting in the last [1s]
[2021-01-11T17:41:32,182][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343203] overhead, spent [324ms] collecting in the last [1s]
[2021-01-11T17:41:34,196][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343205] overhead, spent [430ms] collecting in the last [1s]
[2021-01-11T17:41:35,236][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343206] overhead, spent [279ms] collecting in the last [1s]
[2021-01-11T17:41:36,236][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343207] overhead, spent [285ms] collecting in the last [1s]
[2021-01-11T17:41:41,346][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343212] overhead, spent [254ms] collecting in the last [1s]
[2021-01-11T17:41:42,347][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343213] overhead, spent [301ms] collecting in the last [1s]
[2021-01-11T17:41:57,367][INFO ][o.e.m.j.JvmGcMonitorService] [VSELK-DATA-01] [gc][343228] overhead, spent [311ms] collecting in the last [1s]

Could you help me please to solve this issue ?

Thanks

What is the output from the _cluster/stats?pretty&human API?

Thanks for your answer @warkolm,

so here is what I get when I run the command that you gave me:

{
  "_nodes" : {
    "total" : 7,
    "successful" : 7,
    "failed" : 0
  },
  "cluster_name" : "SIEM-ELK",
  "cluster_uuid" : "zzKhCl6SSaaOItOGl_OPwA",
  "timestamp" : 1610438441796,
  "status" : "green",
  "indices" : {
    "count" : 65,
    "shards" : {
      "total" : 130,
      "primaries" : 65,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 2,
          "avg" : 2.0
        },
        "primaries" : {
          "min" : 1,
          "max" : 1,
          "avg" : 1.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 227228392,
      "deleted" : 593879
    },
    "store" : {
      "size" : "107.2gb",
      "size_in_bytes" : 115126496193,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "104.7kb",
      "memory_size_in_bytes" : 107288,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "38.4mb",
      "memory_size_in_bytes" : 40313107,
      "total_count" : 7479341610,
      "hit_count" : 23410800,
      "miss_count" : 7455930810,
      "cache_size" : 3434,
      "cache_count" : 1295714,
      "evictions" : 1292280
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 1144,
      "memory" : "20.7mb",
      "memory_in_bytes" : 21794202,
      "terms_memory" : "10.6mb",
      "terms_memory_in_bytes" : 11153016,
      "stored_fields_memory" : "674.7kb",
      "stored_fields_memory_in_bytes" : 690944,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "351.8kb",
      "norms_memory_in_bytes" : 360320,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "9.1mb",
      "doc_values_memory_in_bytes" : 9589922,
      "index_writer_memory" : "170mb",
      "index_writer_memory_in_bytes" : 178284016,
      "version_map_memory" : "508.2kb",
      "version_map_memory_in_bytes" : 520440,
      "fixed_bit_set" : "43.2mb",
      "fixed_bit_set_memory_in_bytes" : 45382264,
      "max_unsafe_auto_id_timestamp" : 1610409612796,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 140,
          "index_count" : 4
        },
        {
          "name" : "binary",
          "count" : 15,
          "index_count" : 4
        },
        {
          "name" : "boolean",
          "count" : 478,
          "index_count" : 32
        },
        {
          "name" : "byte",
          "count" : 3,
          "index_count" : 3
        },
        {
          "name" : "constant_keyword",
          "count" : 3,
          "index_count" : 1
        },
        {
          "name" : "date",
          "count" : 679,
          "index_count" : 61
        },
        {
          "name" : "date_nanos",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "date_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "double",
          "count" : 292,
          "index_count" : 15
        },
        {
          "name" : "double_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "flattened",
          "count" : 25,
          "index_count" : 3
        },
        {
          "name" : "float",
          "count" : 305,
          "index_count" : 17
        },
        {
          "name" : "float_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "geo_point",
          "count" : 61,
          "index_count" : 10
        },
        {
          "name" : "geo_shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "half_float",
          "count" : 54,
          "index_count" : 14
        },
        {
          "name" : "integer",
          "count" : 230,
          "index_count" : 17
        },
        {
          "name" : "integer_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "ip",
          "count" : 345,
          "index_count" : 10
        },
        {
          "name" : "ip_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "keyword",
          "count" : 12702,
          "index_count" : 61
        },
        {
          "name" : "long",
          "count" : 6772,
          "index_count" : 41
        },
        {
          "name" : "long_range",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "nested",
          "count" : 43,
          "index_count" : 15
        },
        {
          "name" : "object",
          "count" : 6165,
          "index_count" : 48
        },
        {
          "name" : "scaled_float",
          "count" : 121,
          "index_count" : 1
        },
        {
          "name" : "shape",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "short",
          "count" : 205,
          "index_count" : 4
        },
        {
          "name" : "text",
          "count" : 815,
          "index_count" : 43
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [
        {
          "name" : "pattern_capture",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "uax_url_email",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "unique",
          "count" : 1,
          "index_count" : 1
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "whitespace",
          "count" : 2,
          "index_count" : 2
        }
      ]
    }
  },
  "nodes" : {
    "count" : {
      "total" : 7,
      "coordinating_only" : 0,
      "data" : 0,
      "data_cold" : 0,
      "data_content" : 3,
      "data_hot" : 2,
      "data_warm" : 1,
      "ingest" : 3,
      "master" : 3,
      "ml" : 1,
      "remote_cluster_client" : 0,
      "transform" : 3,
      "voting_only" : 0
    },
    "versions" : [
      "7.10.1"
    ],
    "os" : {
      "available_processors" : 30,
      "allocated_processors" : 30,
      "names" : [
        {
          "name" : "Linux",
          "count" : 7
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Debian GNU/Linux 10 (buster)",
          "count" : 7
        }
      ],
      "mem" : {
        "total" : "40.7gb",
        "total_in_bytes" : 43760119808,
        "free" : "2.6gb",
        "free_in_bytes" : 2841653248,
        "used" : "38.1gb",
        "used_in_bytes" : 40918466560,
        "free_percent" : 6,
        "used_percent" : 94
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 158
      },
      "open_file_descriptors" : {
        "min" : 416,
        "max" : 888,
        "avg" : 617
      }
    },
    "jvm" : {
      "max_uptime" : "13.6d",
      "max_uptime_in_millis" : 1177758426,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 7
        }
      ],
      "mem" : {
        "heap_used" : "8.4gb",
        "heap_used_in_bytes" : 9099716536,
        "heap_max" : "18gb",
        "heap_max_in_bytes" : 19327352832
      },
      "threads" : 708
    },
    "fs" : {
      "total" : "4.1tb",
      "total_in_bytes" : 4564784603136,
      "free" : "4tb",
      "free_in_bytes" : 4422070272000,
      "available" : "3.8tb",
      "available_in_bytes" : 4189774000128
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 7
      },
      "http_types" : {
        "security4" : 7
      }
    },
    "discovery_types" : {
      "zen" : 7
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 7
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 28,
      "processor_stats" : {
        "conditional" : {
          "count" : 379,
          "failed" : 0,
          "current" : 0,
          "time" : "58ms",
          "time_in_millis" : 58
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "date" : {
          "count" : 93,
          "failed" : 0,
          "current" : 0,
          "time" : "13ms",
          "time_in_millis" : 13
        },
        "dot_expander" : {
          "count" : 1023,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 15795425,
          "failed" : 0,
          "current" : 0,
          "time" : "1.2m",
          "time_in_millis" : 73540
        },
        "grok" : {
          "count" : 186,
          "failed" : 0,
          "current" : 0,
          "time" : "74ms",
          "time_in_millis" : 74
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "json" : {
          "count" : 93,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 279,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 1023,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 93,
          "failed" : 0,
          "current" : 0,
          "time" : "5ms",
          "time_in_millis" : 5
        },
        "set" : {
          "count" : 465,
          "failed" : 0,
          "current" : 0,
          "time" : "6ms",
          "time_in_millis" : 6
        },
        "split" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Thanks.
You don't seem to be overloaded with data or shards. You do have a lot of pipelines though which might be adding pressure to the system.

Is there anything that is standing out in the Monitoring section, other than high heap/GC?

Here is an example of what I am getting in one data nodes in the Monitoring section