Very slow on-prem Elasticsearch 8.6.0 cluster

Hi everyone,

I'm having a very slow ES cluster despite I'm not making any change with data volume or number of shards.
Our ES cluster is version 8.6.0 with these node:

  • es01: "data_content","data_hot","ingest","master","transform"
  • es02: "data_content","data_hot","ingest","master","transform"
  • es03: "master", "voting_only"
  • es04: "data_cold"

On es04 I see these warn message repeatly:

[2023-02-08T14:20:21,909][WARN ][o.e.t.OutboundHandler    ] [es04] sending transport message [Response{316186378}{false}{false}{false}{class org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse}] of size [857237] on [Netty4TcpChannel{localAddress=/<es04_ip>:9300, remoteAddress=/<es01_ip>:29994, profile=default}] took [5231ms] which is above the warn threshold of [5000ms] with success [true] 
[2023-02-08T14:20:32,165][WARN ][o.e.t.OutboundHandler    ] [es04] sending transport message [Response{316189298}{false}{false}{false}{class org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse}] of size [857237] on [Netty4TcpChannel{localAddress=/<es04_ip>:9300, remoteAddress=/<es01_ip>:29996, profile=default}] took [5507ms] which is above the warn threshold of [5000ms] with success [true]

On es01 got these warn:

[2023-02-08T13:16:30,590][WARN ][o.e.c.InternalClusterInfoService] [es01] failed to retrieve stats for node [Nmd_-r9DRFSheCUqLn8jPw] org.elasticsearch.transport.ReceiveTimeoutTransportException: [es04][<es04_ip>:9300][cluster:monitor/nodes/stats[n]] request_id [315291918] timed out after [15008ms] 
[2023-02-08T13:16:42,222][WARN ][o.e.t.TransportService   ] [es01] Received response for a request that has timed out, sent [26.8s/26815ms] ago, timed out [11.8s/11807ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es04}{Nmd_-r9DRFSheCUqLn8jPw}{80PziYyxTHeGsjEtC-DDtg}{es04}{<es04_ip>}{<es04_ip>:9300}{c}{xpack.installed=true}], id [315291918]

The output of _cluster/stats?pretty&human API:

{
  "_nodes": {
    "total": 4,
    "successful": 4,
    "failed": 0
  },
  "cluster_name": "<my_cluster>",
  "cluster_uuid": "KlOFqtglR6a8BDUVMN3_Dw",
  "timestamp": 1676101153599,
  "status": "green",
  "indices": {
    "count": 389,
    "shards": {
      "total": 537,
      "primaries": 389,
      "replication": 0.38046272493573263,
      "index": {
        "shards": {
          "min": 1,
          "max": 2,
          "avg": 1.3804627249357326
        },
        "primaries": {
          "min": 1,
          "max": 1,
          "avg": 1
        },
        "replication": {
          "min": 0,
          "max": 1,
          "avg": 0.38046272493573263
        }
      }
    },
    "docs": {
      "count": 19565100046,
      "deleted": 314738
    },
    "store": {
      "size": "7.9tb",
      "size_in_bytes": 8713808466017,
      "total_data_set_size": "7.9tb",
      "total_data_set_size_in_bytes": 8713808466017,
      "reserved": "0b",
      "reserved_in_bytes": 0
    },
    "fielddata": {
      "memory_size": "53.2mb",
      "memory_size_in_bytes": 55833664,
      "evictions": 0
    },
    "query_cache": {
      "memory_size": "29.8mb",
      "memory_size_in_bytes": 31261813,
      "total_count": 116574573,
      "hit_count": 6427948,
      "miss_count": 110146625,
      "cache_size": 23034,
      "cache_count": 47878,
      "evictions": 24844
    },
    "completion": {
      "size": "0b",
      "size_in_bytes": 0
    },
    "segments": {
      "count": 8322,
      "memory": "0b",
      "memory_in_bytes": 0,
      "terms_memory": "0b",
      "terms_memory_in_bytes": 0,
      "stored_fields_memory": "0b",
      "stored_fields_memory_in_bytes": 0,
      "term_vectors_memory": "0b",
      "term_vectors_memory_in_bytes": 0,
      "norms_memory": "0b",
      "norms_memory_in_bytes": 0,
      "points_memory": "0b",
      "points_memory_in_bytes": 0,
      "doc_values_memory": "0b",
      "doc_values_memory_in_bytes": 0,
      "index_writer_memory": "86.7mb",
      "index_writer_memory_in_bytes": 90975452,
      "version_map_memory": "123.5kb",
      "version_map_memory_in_bytes": 126534,
      "fixed_bit_set": "1.5gb",
      "fixed_bit_set_memory_in_bytes": 1616717456,
      "max_unsafe_auto_id_timestamp": 1676096426797,
      "file_sizes": {}
    },
    "mappings": {
      "total_field_count": 278869,
      "total_deduplicated_field_count": 128808,
      "total_deduplicated_mapping_size": "614.4kb",
      "total_deduplicated_mapping_size_in_bytes": 629220,
      "field_types": [
        {
          "name": "alias",
          "count": 2423,
          "index_count": 27,
          "script_count": 0
        },
        {
          "name": "binary",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "boolean",
          "count": 3882,
          "index_count": 303,
          "script_count": 0
        },
        {
          "name": "byte",
          "count": 3,
          "index_count": 3,
          "script_count": 0
        },
        {
          "name": "constant_keyword",
          "count": 1135,
          "index_count": 298,
          "script_count": 0
        },
        {
          "name": "date",
          "count": 8926,
          "index_count": 337,
          "script_count": 0
        },
        {
          "name": "date_nanos",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "date_range",
          "count": 21,
          "index_count": 21,
          "script_count": 0
        },
        {
          "name": "double",
          "count": 818,
          "index_count": 22,
          "script_count": 0
        },
        {
          "name": "double_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "flattened",
          "count": 1487,
          "index_count": 149,
          "script_count": 0
        },
        {
          "name": "float",
          "count": 1915,
          "index_count": 172,
          "script_count": 0
        },
        {
          "name": "float_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "geo_point",
          "count": 1206,
          "index_count": 223,
          "script_count": 0
        },
        {
          "name": "geo_shape",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "half_float",
          "count": 33,
          "index_count": 12,
          "script_count": 0
        },
        {
          "name": "histogram",
          "count": 3,
          "index_count": 3,
          "script_count": 0
        },
        {
          "name": "integer",
          "count": 15,
          "index_count": 13,
          "script_count": 0
        },
        {
          "name": "integer_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "ip",
          "count": 2827,
          "index_count": 321,
          "script_count": 0
        },
        {
          "name": "ip_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "keyword",
          "count": 163665,
          "index_count": 338,
          "script_count": 0
        },
        {
          "name": "long",
          "count": 29241,
          "index_count": 290,
          "script_count": 0
        },
        {
          "name": "long_range",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "match_only_text",
          "count": 7311,
          "index_count": 238,
          "script_count": 0
        },
        {
          "name": "nested",
          "count": 1698,
          "index_count": 154,
          "script_count": 0
        },
        {
          "name": "object",
          "count": 48130,
          "index_count": 334,
          "script_count": 0
        },
        {
          "name": "scaled_float",
          "count": 617,
          "index_count": 155,
          "script_count": 0
        },
        {
          "name": "shape",
          "count": 1,
          "index_count": 1,
          "script_count": 0
        },
        {
          "name": "short",
          "count": 425,
          "index_count": 10,
          "script_count": 0
        },
        {
          "name": "text",
          "count": 616,
          "index_count": 160,
          "script_count": 0
        },
        {
          "name": "unsigned_long",
          "count": 31,
          "index_count": 7,
          "script_count": 0
        },
        {
          "name": "version",
          "count": 22,
          "index_count": 22,
          "script_count": 0
        },
        {
          "name": "wildcard",
          "count": 2410,
          "index_count": 191,
          "script_count": 0
        }
      ],
      "runtime_field_types": []
    },
    "analysis": {
      "char_filter_types": [],
      "tokenizer_types": [],
      "filter_types": [],
      "analyzer_types": [],
      "built_in_char_filters": [],
      "built_in_tokenizers": [],
      "built_in_filters": [],
      "built_in_analyzers": []
    },
    "versions": [
      {
        "version": "8.0.0",
        "index_count": 22,
        "primary_shard_count": 22,
        "total_primary_size": "18.2mb",
        "total_primary_bytes": 19104002
      },
      {
        "version": "8.1.2",
        "index_count": 192,
        "primary_shard_count": 192,
        "total_primary_size": "2.2tb",
        "total_primary_bytes": 2496656326799
      },
      {
        "version": "8.6.0",
        "index_count": 175,
        "primary_shard_count": 175,
        "total_primary_size": "5.2tb",
        "total_primary_bytes": 5783902462356
      }
    ],
    "search": {
      "total": 807643,
      "queries": {
        "regexp": 4254,
        "bool": 687574,
        "prefix": 3599,
        "match": 300864,
        "range": 380379,
        "nested": 33,
        "wildcard": 3,
        "multi_match": 200,
        "match_phrase": 146783,
        "terms": 161705,
        "constant_score": 859,
        "match_phrase_prefix": 45,
        "ids": 4028,
        "match_all": 109621,
        "exists": 357960,
        "term": 515557,
        "simple_query_string": 76982,
        "query_string": 8713
      },
      "sections": {
        "highlight": 270,
        "search_after": 7330,
        "stored_fields": 608,
        "runtime_mappings": 144929,
        "query": 718033,
        "script_fields": 608,
        "_source": 13261,
        "pit": 19416,
        "terminate_after": 156,
        "fields": 116887,
        "collapse": 50607,
        "aggs": 191626
      }
    }
  },
  "nodes": {
    "count": {
      "total": 4,
      "coordinating_only": 0,
      "data": 0,
      "data_cold": 1,
      "data_content": 2,
      "data_frozen": 0,
      "data_hot": 2,
      "data_warm": 0,
      "index": 0,
      "ingest": 2,
      "master": 3,
      "ml": 0,
      "remote_cluster_client": 0,
      "search": 0,
      "transform": 2,
      "voting_only": 1
    },
    "versions": [
      "8.6.0"
    ],
    "os": {
      "available_processors": 48,
      "allocated_processors": 48,
      "names": [
        {
          "name": "Linux",
          "count": 4
        }
      ],
      "pretty_names": [
        {
          "pretty_name": "Ubuntu 18.04.6 LTS",
          "count": 4
        }
      ],
      "architectures": [
        {
          "arch": "amd64",
          "count": 4
        }
      ],
      "mem": {
        "total": "92gb",
        "total_in_bytes": 98881466368,
        "adjusted_total": "92gb",
        "adjusted_total_in_bytes": 98881466368,
        "free": "4.7gb",
        "free_in_bytes": 5093277696,
        "used": "87.3gb",
        "used_in_bytes": 93788188672,
        "free_percent": 5,
        "used_percent": 95
      }
    },
    "process": {
      "cpu": {
        "percent": 46
      },
      "open_file_descriptors": {
        "min": 560,
        "max": 4248,
        "avg": 2143
      }
    },
    "jvm": {
      "max_uptime": "3d",
      "max_uptime_in_millis": 261641669,
      "versions": [
        {
          "version": "19.0.1",
          "vm_name": "OpenJDK 64-Bit Server VM",
          "vm_version": "19.0.1+10-21",
          "vm_vendor": "Oracle Corporation",
          "bundled_jdk": true,
          "using_bundled_jdk": true,
          "count": 4
        }
      ],
      "mem": {
        "heap_used": "13.8gb",
        "heap_used_in_bytes": 14855281888,
        "heap_max": "44.1gb",
        "heap_max_in_bytes": 47404023808
      },
      "threads": 502
    },
    "fs": {
      "total": "16.7tb",
      "total_in_bytes": 18373235757056,
      "free": "8.7tb",
      "free_in_bytes": 9591671062528,
      "available": "8tb",
      "available_in_bytes": 8810034135040
    },
    "plugins": [],
    "network_types": {
      "transport_types": {
        "security4": 4
      },
      "http_types": {
        "security4": 4
      }
    },
    "discovery_types": {
      "multi-node": 4
    },
    "packaging_types": [
      {
        "flavor": "default",
        "type": "deb",
        "count": 4
      }
    ],
    "ingest": {
      "number_of_pipelines": 319,
      "processor_stats": {
        "append": {
          "count": 140343861,
          "failed": 0,
          "current": 0,
          "time": "2.1m",
          "time_in_millis": 126061
        },
        "community_id": {
          "count": 410689266,
          "failed": 518684,
          "current": 0,
          "time": "54.9m",
          "time_in_millis": 3294763
        },
        "conditional": {
          "count": 8518573873,
          "failed": 108781,
          "current": 0,
          "time": "1.1d",
          "time_in_millis": 95615210
        },
        "convert": {
          "count": 3536505778,
          "failed": 252735416,
          "current": 0,
          "time": "1.2h",
          "time_in_millis": 4529251
        },
        "csv": {
          "count": 4851,
          "failed": 0,
          "current": 0,
          "time": "126ms",
          "time_in_millis": 126
        },
        "date": {
          "count": 1970806100,
          "failed": 66603464,
          "current": 0,
          "time": "4.5h",
          "time_in_millis": 16390025
        },
        "dissect": {
          "count": 365017597,
          "failed": 334171895,
          "current": 0,
          "time": "59.2m",
          "time_in_millis": 3552113
        },
        "dot_expander": {
          "count": 331475023,
          "failed": 0,
          "current": 0,
          "time": "17.7m",
          "time_in_millis": 1066733
        },
        "enrich": {
          "count": 3138570,
          "failed": 0,
          "current": 0,
          "time": "2m",
          "time_in_millis": 120119
        },
        "fingerprint": {
          "count": 10065915,
          "failed": 0,
          "current": 0,
          "time": "2.5m",
          "time_in_millis": 154302
        },
        "foreach": {
          "count": 214567,
          "failed": 0,
          "current": 0,
          "time": "626ms",
          "time_in_millis": 626
        },
        "geoip": {
          "count": 2094619950,
          "failed": 380,
          "current": 0,
          "time": "6.9h",
          "time_in_millis": 25072611
        },
        "grok": {
          "count": 1277350232,
          "failed": 265091616,
          "current": 1,
          "time": "14.9d",
          "time_in_millis": 1289944840
        },
        "gsub": {
          "count": 17972404,
          "failed": 0,
          "current": 0,
          "time": "6.2m",
          "time_in_millis": 372964
        },
        "json": {
          "count": 32745164,
          "failed": 8009,
          "current": 0,
          "time": "22.9m",
          "time_in_millis": 1375130
        },
        "kv": {
          "count": 491418,
          "failed": 52107,
          "current": 0,
          "time": "18.3s",
          "time_in_millis": 18318
        },
        "lowercase": {
          "count": 589830566,
          "failed": 371402793,
          "current": 0,
          "time": "29.9m",
          "time_in_millis": 1799212
        },
        "pipeline": {
          "count": 647692291,
          "failed": 0,
          "current": 0,
          "time": "4.6m",
          "time_in_millis": 276820
        },
        "registered_domain": {
          "count": 4911936,
          "failed": 0,
          "current": 0,
          "time": "43.9s",
          "time_in_millis": 43929
        },
        "remove": {
          "count": 2663944271,
          "failed": 0,
          "current": 0,
          "time": "1h",
          "time_in_millis": 3928697
        },
        "rename": {
          "count": 4958622563,
          "failed": 59068390,
          "current": 0,
          "time": "2.3h",
          "time_in_millis": 8527009
        },
        "script": {
          "count": 1190769152,
          "failed": 5163,
          "current": 0,
          "time": "2.2h",
          "time_in_millis": 8114448
        },
        "set": {
          "count": 4968648433,
          "failed": 0,
          "current": 0,
          "time": "4.4h",
          "time_in_millis": 16039357
        },
        "set_security_user": {
          "count": 638794053,
          "failed": 0,
          "current": 0,
          "time": "1h",
          "time_in_millis": 3939516
        },
        "split": {
          "count": 585292767,
          "failed": 0,
          "current": 0,
          "time": "15m",
          "time_in_millis": 902501
        },
        "trim": {
          "count": 421192,
          "failed": 1126,
          "current": 0,
          "time": "506ms",
          "time_in_millis": 506
        },
        "uppercase": {
          "count": 69231,
          "failed": 0,
          "current": 0,
          "time": "64ms",
          "time_in_millis": 64
        },
        "uri_parts": {
          "count": 42710844,
          "failed": 186081,
          "current": 0,
          "time": "2.3m",
          "time_in_millis": 140765
        },
        "urldecode": {
          "count": 37123722,
          "failed": 0,
          "current": 0,
          "time": "1.4m",
          "time_in_millis": 86386
        },
        "user_agent": {
          "count": 72010796,
          "failed": 5192,
          "current": 0,
          "time": "36.4m",
          "time_in_millis": 2186592
        }
      }
    },
    "indexing_pressure": {
      "memory": {
        "current": {
          "combined_coordinating_and_primary": "0b",
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating": "0b",
          "coordinating_in_bytes": 0,
          "primary": "0b",
          "primary_in_bytes": 0,
          "replica": "0b",
          "replica_in_bytes": 0,
          "all": "0b",
          "all_in_bytes": 0
        },
        "total": {
          "combined_coordinating_and_primary": "0b",
          "combined_coordinating_and_primary_in_bytes": 0,
          "coordinating": "0b",
          "coordinating_in_bytes": 0,
          "primary": "0b",
          "primary_in_bytes": 0,
          "replica": "0b",
          "replica_in_bytes": 0,
          "all": "0b",
          "all_in_bytes": 0,
          "coordinating_rejections": 0,
          "primary_rejections": 0,
          "replica_rejections": 0
        },
        "limit": "0b",
        "limit_in_bytes": 0
      }
    }
  }
}

CPU, Disk and Heap is at normal state and not having much shard in cluster:

Can you help me find out the reason why my cluster is so slow?
Thanks.

You mentioned that the cluster is slow, but did not specify how or for what operations apart from showing the warning messages around stats collection. Could you please elaborate?

What type of hardware is this cluster deployed on? What is the hardware specification of the different nodes in terms of CPU and RAM allocation? What type of storage does the different nodes use?

It's much slow when performing any task on Kibana with this ES cluster, for example go to Discover to search logs.
My cluster node are all VM deployed on an ESXi server. Storage type is SANs (I don't know exactly it is HDD or SSD, but our cluster still had good performance before):

  • es01, es02: 16 CPU core, 32GB Mem, 700GB Disk.
  • es03: 4 CPU core, 4GB Mem, 120GB Disk.
  • es04: 16 CPU core, 24GB Mem, 14TB Disk.

How are you searching for logs?

What time period do you usully query when searching logs?

At what point do you move data to the cold tier?

Is it slow to search logs if you specify a time window so you only target data not on the cold nodes?

How does this compare to querying a larger window that also includes data on the cold tier?

Can you run iostat -x on the nodes while you are querying and it is slow to see what disk I/O looks like? Elasticsearch performance is often limited by the storage used.

It seems not enough disk space. Default values are:

cluster.routing.allocation.disk.watermark.low: (Default 85%)
cluster.routing.allocation.disk.watermark.high: (Default 90%)
cluster.routing.allocation.disk.watermark.flood_stage: (Default 95%)

I see it both slow searching with any time window include hot and cold tier.
It is also much slow when go to Discover to search (will stuck on this loading window for a long time before I can search) and feel unresponsive when switching Data View

Here is the iostat -x output on cold node while querying a large window cover both data on hot and cold tier:

Linux 4.15.0-175-generic (es04) 	02/13/2023 	_x86_64_	(12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.71    0.00    0.29    1.06    0.00   97.93

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     1.86     0.00   0.00   0.00
fd0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00   56.00    0.00   0.00     4.00     0.00  56.00   0.00
sdb              0.09    0.33     10.46     15.67     0.00     0.83   0.59  71.63    3.48    1.16   0.00   118.26    47.87   0.69   0.03
sdc              0.20    0.22     10.72     77.72     0.00     0.02   1.96   7.32    3.68    5.45   0.00    52.57   347.83   1.02   0.04
sdd              0.16    0.45     12.75    207.90     0.00     0.02   0.91   4.30    4.53    7.86   0.00    80.16   459.47   0.86   0.05
sde              0.22    0.95     25.14    514.74     0.00     0.02   0.58   1.58    5.34   14.59   0.02   113.75   539.99   0.57   0.07
sdf              0.23    0.83     24.15    514.24     0.00     0.02   0.69   1.90    6.63   18.93   0.02   107.08   621.49   0.58   0.06
sdg              0.07    0.90      7.31    514.34     0.00     0.01   0.05   0.91    6.33   18.72   0.02   100.22   571.73   0.41   0.04
sdh              0.09    0.90      9.38    496.33     0.00     0.01   1.31   0.83    8.01   18.79   0.02   102.69   553.05   0.46   0.05
sdi              0.07    0.65      7.63    342.59     0.00     0.01   0.32   0.85    3.95   18.81   0.01   107.89   527.24   0.41   0.03
sdj              0.09    0.82      9.16    421.05     0.00     0.01   0.18   1.14    4.92   18.04   0.02    98.75   516.54   0.44   0.04
sdk              0.07    0.00      8.64      0.03     0.00     0.01   0.09  86.27    3.82    2.50   0.00   116.49    29.14   1.76   0.01
sda              1.16    4.80     36.95     85.79     0.06     2.53   4.84  34.51    4.43    1.62   0.01    31.82    17.88   0.88   0.52
dm-0             1.31    6.98    125.33   6702.65     0.00     0.00   0.00   0.00    5.58   13.85   0.10    95.63   960.33   0.46   0.38
dm-1             1.22    7.32     36.91    103.12     0.00     0.00   0.00   0.00    4.74    1.31   0.02    30.24    14.08   0.61   0.53
dm-2             0.00    0.00      0.01      0.00     0.00     0.00   0.00   0.00    2.06    0.00   0.00    22.66     0.00   1.36   0.00

The iostat -x output on hot node:

Linux 4.15.0-175-generic (sdc01-es01-vp) 	02/13/2023 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.04    0.00    2.63   12.17    0.00   57.16

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     1.86     0.00   0.00   0.00
fd0              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00   56.00    0.00   0.00     4.00     0.00  56.00   0.00
sda            122.03  313.82  12444.41  35764.24    12.49   954.49   9.29  75.26    4.40    2.29   1.24   101.98   113.96   1.76  76.76
dm-0           134.66 1268.33  12444.33  58160.20     0.00     0.00   0.00   0.00    9.27    0.77   0.83    92.41    45.86   0.55  76.79
dm-1             0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    4.13    0.00   0.00    21.04     0.00   3.24   0.00

About disk free space, it should be enough on any node and doesn't reach disk warning limit (~200/700GB free on hot node, 7/14TB free on cold node).

One thing that stands out to me, from your initial post's screenshot, es02 has a load average of 17.52. Is this still the case? You mention in your follow up post that es02 has 16 CPU core. If this is the case, then your load average should ideally be below <16. A load average greater than that of the core count generally indicates that your is potentially being overloaded at times which could cause slowness.

Depending on which node, es01 or es02 is elected master at the time, could cause it to require more CPU than what is actively available.

Some recommendations:

  1. Check the load averages of your nodes, if they are regularly above the number of CPU cores that are on the node, consider increasing the number of CPU cores available to the node.
  2. See what is using the CPU on these machines, what amount of time is it using spending on things like IOwait (diskIO).
  3. I see you have transform roles enabled on your nodes, by chance did someone add new transforms or change existing ones that could have added additional load?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.