Performance problem

I'm trying to figure why our cluster seems to be pertually maxed out. Nnode 3 does the majority of the ingesting and is consistently around 90 - 95% CPU utilization. The other two nodes vary from 30 - 85% CPU utilization. Within the last month or so, we have started regularly experiencing a few unassigned shards. The cause is always allocation failure and manual allocation solves the problem. Our Elastic cluster is self-managed and are VMs in a dedicated VMware cluster. We use it as a SIEM as well as for certificate and uptime monitoring. We have a Fleet server managing about 250 Elastic agents. Mostly Windows hosts using the System, Windows and Sysmon integrations. Some Linux agents using the System integration. A host that uses the System, Cisco Meraki, and Microsft Defender for Endpoint integrations. A host that uses the System, TCP Custom Logs, and vsphere integrations. A couple DHCP servers use the MS DHCP integration.

I've included the output of GET _cluster/stat below. I can post the output of other commands if needed. Any suggestions to point me in the right direction would be very helpful.

Thank you in advance.

{
  "_nodes": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "cluster_name": "xxxxxxxxxxx",
  "cluster_uuid": "xxxxxxxxxxxxxxxxxxxxx",
  "timestamp": 1714487275838,
  "status": "green",
  "indices": {
    "count": 280,
    "shards": {
      "total": 561,
      "primaries": 280,
      "replication": 1.0035714285714286,
      "index": {
        "shards": {
          "min": 2,
          "max": 3,
          "avg": 2.0035714285714286
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1
        },
        "replication": {
          "min": 1,
          "max": 2,
          "avg": 1.0035714285714286
        }
      }
    },
    "docs": {
      "count": 11128208032,
      "deleted": 141714
    },
    "store": {
      "size": "6.5tb",
      "size_in_bytes": 7162796933088,
      "total_data_set_size": "6.5tb",
      "total_data_set_size_in_bytes": 7162796933088,
      "reserved": "0b",
      "reserved_in_bytes": 0
    },
    "fielddata": {
      "memory_size": "638.3kb",
      "memory_size_in_bytes": 653648,
      "evictions": 9619,
      "global_ordinals": {
        "build_time": "3.8h",
        "build_time_in_millis": 13934138
      }
    },
    "query_cache": {
      "memory_size": "2.3gb",
      "memory_size_in_bytes": 2518797659,
      "total_count": 581852207,
      "hit_count": 63986884,
      "miss_count": 517865323,
      "cache_size": 173123,
      "cache_count": 2252620,
      "evictions": 2079497
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 14151,
      "memory": "0b",
      "memory_in_bytes": 0,
      "terms_memory": "0b",
      "terms_memory_in_bytes": 0,
      "stored_fields_memory": "0b",
      "stored_fields_memory_in_bytes": 0,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "0b",
      "norms_memory_in_bytes": 0,
      "points_memory": "0b",
      "points_memory_in_bytes": 0,
      "doc_values_memory": "0b",
      "doc_values_memory_in_bytes": 0,
      "index_writer_memory": "110.9mb",
      "index_writer_memory_in_bytes": 116341154,
      "version_map_memory": "910.5kb",
      "version_map_memory_in_bytes": 932353,
      "fixed_bit_set": "5mb",
      "fixed_bit_set_memory_in_bytes": 5325704,
      "max_unsafe_auto_id_timestamp": 1714479426485,
      "file_sizes": {}
    },
   ...
truncated
    "versions": [
      "8.12.2"
    ],
    "os": {
      "available_processors": 40,
      "allocated_processors": 40,
      "names": [
        {
          "name": "Linux",
          "count": 3
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Ubuntu 22.04.4 LTS",
          "count": 3
        }
      ],
      "architectures": [
        {
          "arch": "amd64",
          "count": 3
        }
      ],
      "mem": {
        "total": "58.7gb",
        "total_in_bytes": 63050027008,
        "adjusted_total": "58.7gb",
        "adjusted_total_in_bytes": 63050027008,
        "free": "1.9gb",
        "free_in_bytes": 2135818240,
        "used": "56.7gb",
        "used_in_bytes": 60914208768,
        "free_percent": 3,
        "used_percent": 97
      }
    },
    "process": {
      "cpu": {
        "percent": 265
      },
      "open_file_descriptors": {
        "min": 2766,
        "max": 2974,
        "avg": 2884
      }
    },
    "jvm": {
      "max_uptime": "40d",
      "max_uptime_in_millis": 3463242910,
      "versions": [
        {
          "version": "21.0.2",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "21.0.2+13-58",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 3
        }
      ],
      "mem": {
        "heap_used": "21.3gb",
        "heap_used_in_bytes": 22924166096,
        "heap_max": "30gb",
        "heap_max_in_bytes": 32212254720
      },
      "threads": 960
    },
    "fs": {
      "total": "8.8tb",
      "total_in_bytes": 9736877236224,
      "free": "2.3tb",
      "free_in_bytes": 2572628639744,
      "available": "1.8tb",
      "available_in_bytes": 2077798400000
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 3
      },
      "http_types": {
        "security4": 3
      }
    },
    "discovery_types": {
      "multi-node": 3
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "deb",
        "count": 3
      }
    ],
    "ingest": {
      "number_of_pipelines": 199,
      "processor_stats": {
        "append": {
          "count": 11979076674,
          "failed": 0,
          "current": 0,
          "time": "1.5d",
          "time_in_millis": 133445345
        },
...
truncated

You truncated the output, so it is not clear whether all nodes are identical in roles and specification or not. Why does a specific node do most of the indexing? Does this have some specific role or is this the one clients connect to?

The nodes are all the same as shown in the output of /_cat/nodes below.

ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
xxxxxxxxx           50          99  90   15.56   17.10    16.46 cdfhilmrstw *      xxxxxxxxxx
xxxxxxxxx           67          95  91   20.15   20.49    20.60 cdfhilmrstw -      xxxxxxxxxx
xxxxxxxxx           70          99  51    7.98    6.82     8.08 cdfhilmrstw -      xxxxxxxxx

In our internal DNS, the domain name the agents connect to resolves to node 3. This accounts for about 85% of the elastic agent traffic.

That sounds like a bad solution. What happens if that node is down? Isn't there any point of having a highly available cluster if only one node is exposed to the clients?

I would recommend configuring all clients with a list of all nodes in the cluster so they can distribute the load across the cluster. That should even out load and resolve uneven load. This is how Elasticsearch is generally deployed.

What are the specs for your nodes, like CPU, RAM, Heap configured and disk type?

Did you make any changes to the settings of the templates or are using the default? Mostly the number of shards and refresh_interval.

1 Like

I would recommend configuring all clients with a list of all nodes

Is this done in the Elastic Agent config? If so, where?

Each node has 20GB RAM. Nodes 1 & 2 have 14 CPUs and node 3 has 12 CPUs. Heap settings are shown in my original post. The disk type is Compellant shared storage.

Have you looked at the official documentation?

For optimal performance Elasticsearch requires fast storage, ideally local SSDs. I am not familiar with this type f storage, but would recommend you look at await and disk utilisation on the nodes that are under the heavist load. Storage performance is one of the most common limiting factors when it comes to Elasticsearch.

You need to configure it in the Fleet Settings in Kibana, in the Outputs section you will have your Elasticsearch output, you will need to add your two other nodes, so the Agents will load balance the requests between then.

But I do not think this is the issue, or at least not only this.

You said that your node 3 is constantly at 90-95% and the other 2 nodes vary from 30-85% CPU usage, but in your _cat/nodes it is not possible to know which node is which and you have 2 nodes with CPU issues.

Can you run _cat/nodes again and share the results but identify which node is which?

But what is this? Is this HDD or SSD? HDD are pretty bad for performance.

And about this? I'm assuming you are using the default settings for everything in the Agent, right?

Did you make any changes to the settings of the templates or are using the default? Mostly the number of shards and refresh_interval.

You may be having some issues related to hot spotting.

Just the fact that your nodes are not equal can lead to this.

Per default the Elastic Agent will use 1 shard and 1 replica for the datastreams, this can lead to an uneven distribution of data, and the default refresh_interval is 1s this can also impact in the performance.

I had some issues that required me to change both the default number of shards to match the number of my hot nodes and I also increased the refresh_interval to 20s.

leandrojmp, I think you may be on to something.

output of _cat/nodes

name        master node.role   heap.percent disk.used_percent cpu
node1       *      cdfhilmrstw           56             91.93  52
node2       -      cdfhilmrstw           71             70.02  93
node3       -      cdfhilmrstw           68             74.84  93

output of_cat/thread_pool, which seems very bad.

n      nn             q  a r         c
search node1          0  0 0  61979635
search node2          0  0 0  20523393
search node3          0  0 0  14787678
write  node1         29 14 0 456452661
write  node2          1 14 0 482851374
write  node3       2457 12 0 420922570

I saw your 12/23 post asking the following

First question is, where is the number_of_shards and refresh_interval being set? Is it hardcoded somewhere?

But it didn't have answer. Can this be done globally and where did you make these setting changes?

Thanks