Search and bulk transport tasks stuck running for days on ES 7.17.5

elasticsearch version: 7.17.5
lucene_version: 8.11.1

Hi, we have two elasticsearch clusters that have had degraded search and ingestion performance over the past two+ weeks. We have search/scroll/count/bulk client requests that usually take less than 2s running for > 17 minutes. We found that both clusters had tasks running for days. Cluster A had thousands of tasks running for > 10 days. Cluster B had tasks running for four days. We've scaled out both clusters (doubling the number of data nodes in Cluster B) but haven't seen any improvements. We only see improved performance after restarting the nodes with long running tasks. But after a few hours, we start to see long running tasks building up and eventually experience degraded performance again.

Most of the long running tasks are concentrated on a small number of nodes (different nodes each time), but we've seen up to 15 nodes with tasks running for longer than an hour. The cluster doesn't have increased CPU or heap usage and isn't returning 503s or 429s.

We've noticed is that our cluster data size decreases once these nodes have restarted, so we suspect that the cluster doesn't have enough resources to merge segments and clean up deleted documents. I haven't seen much correlation between specific shards and the long running tasks but the tasks tend to belong to our highest traffic indices.

Cluster A:
AWS EC2 m5.2xlarge
data size: 14.8 TB

{
    "cluster_name": "",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 160,
    "number_of_data_nodes": 157,
    "active_primary_shards": 2442,
    "active_shards": 4952,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Cluster B:
AWS EC2 r5.xlarge
data size: 3.33TB

{
    "cluster_name": "",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 66,
    "number_of_data_nodes": 63,
    "active_primary_shards": 880,
    "active_shards": 1781,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Do you know of any reasons why these tasks would be running for so long?

Any help would be greatly appreciated. Thanks in advance.

Elasticsearch is often limited by storage performance. What type of storage are you using? How much storage does each node have? Have you looked at storage performance, e.g. using iostat?

What does the hot threads API show on the nodes that are slow?

What is the full output of the cluster stats API?

Is this correct? You have 157 m5.2xlarge data nodes for 15 TB of data? Or is this per node?

Each node has 30GiB of EBS storage with 15GB heap

We haven't taken a close look at our storage perf, but this is the iostat output for on of the nodes

Linux 3.10.0-1160.80.1.el7.x86_64 (data-152) 	12/19/2022 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.33    0.00    0.92    0.14    0.00   94.60

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1          53.04      1044.55       379.24  504915932  183318052
nvme0n1           6.84        24.73        43.46   11955215   21006882
dm-0             53.72      1044.55       379.24  504914849  183318052

Here's the hot_threads output for a few of the nodes with 3 day old tasks

::: {data-150}{}{}{}{}{}{ml.machine_memory=32727298048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:32:28.977Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {data-207}{}{}{}{}{}{ml.machine_memory=33079619584, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:34:35.618Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {data-152}{}{}{}{}{}{ml.machine_memory=32727298048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:35:56.294Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Cluster stats output:

{
    "_nodes": {
        "total": 160,
        "successful": 160,
        "failed": 0
    },
    "cluster_name": "",
    "cluster_uuid": "",
    "timestamp": 1671471473934,
    "status": "green",
    "indices": {
        "count": 371,
        "shards": {
            "total": 4992,
            "primaries": 2462,
            "replication": 1.0276198212835093,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 142,
                    "avg": 13.455525606469003
                },
                "primaries": {
                    "min": 1,
                    "max": 71,
                    "avg": 6.636118598382749
                },
                "replication": {
                    "min": 1.0,
                    "max": 68.0,
                    "avg": 1.1832884097035041
                }
            }
        },
        "docs": {
            "count": 107318112563,
            "deleted": 22390249222
        },
        "store": {
            "size_in_bytes": 16569470119330,
            "total_data_set_size_in_bytes": 16569470119330,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 9425147456,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 20508901766,
            "total_count": 12615585675,
            "hit_count": 233399010,
            "miss_count": 12382186665,
            "cache_size": 4791806,
            "cache_count": 6632333,
            "evictions": 1840527
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 96858,
            "memory_in_bytes": 1410578530,
            "terms_memory_in_bytes": 462039416,
            "stored_fields_memory_in_bytes": 649930480,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 4330624,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 294278010,
            "index_writer_memory_in_bytes": 739513238,
            "version_map_memory_in_bytes": 15387294,
            "fixed_bit_set_memory_in_bytes": 32153610840,
            "max_unsafe_auto_id_timestamp": 1671408000479,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 1245,
                    "index_count": 359,
                    "script_count": 0
                },
                {
                    "name": "constant_keyword",
                    "count": 6,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "date",
                    "count": 6273,
                    "index_count": 376,
                    "script_count": 0
                },
                {
                    "name": "double",
                    "count": 1029,
                    "index_count": 342,
                    "script_count": 0
                },
                {
                    "name": "float",
                    "count": 70,
                    "index_count": 15,
                    "script_count": 0
                },
                {
                    "name": "half_float",
                    "count": 50,
                    "index_count": 12,
                    "script_count": 0
                },
                {
                    "name": "integer",
                    "count": 2200,
                    "index_count": 351,
                    "script_count": 0
                },
                {
                    "name": "ip",
                    "count": 2,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "join",
                    "count": 342,
                    "index_count": 342,
                    "script_count": 0
                },
                {
                    "name": "keyword",
                    "count": 17945,
                    "index_count": 377,
                    "script_count": 0
                },
                {
                    "name": "long",
                    "count": 2680,
                    "index_count": 368,
                    "script_count": 0
                },
                {
                    "name": "nested",
                    "count": 1062,
                    "index_count": 359,
                    "script_count": 0
                },
                {
                    "name": "object",
                    "count": 921,
                    "index_count": 35,
                    "script_count": 0
                },
                {
                    "name": "scaled_float",
                    "count": 340,
                    "index_count": 340,
                    "script_count": 0
                },
                {
                    "name": "text",
                    "count": 545,
                    "index_count": 269,
                    "script_count": 0
                },
                {
                    "name": "version",
                    "count": 10,
                    "index_count": 10,
                    "script_count": 0
                }
            ],
            "runtime_field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [
                {
                    "name": "keyword",
                    "count": 345,
                    "index_count": 345
                }
            ],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        },
        "versions": [
            {
                "version": "6.5.4",
                "index_count": 1,
                "primary_shard_count": 1,
                "total_primary_bytes": 61501
            },
            {
                "version": "6.8.7",
                "index_count": 8,
                "primary_shard_count": 116,
                "total_primary_bytes": 2189325034023
            },
            {
                "version": "6.8.9",
                "index_count": 93,
                "primary_shard_count": 590,
                "total_primary_bytes": 1131590486429
            },
            {
                "version": "6.8.10",
                "index_count": 173,
                "primary_shard_count": 1077,
                "total_primary_bytes": 2042697418122
            },
            {
                "version": "6.8.21",
                "index_count": 34,
                "primary_shard_count": 321,
                "total_primary_bytes": 1000801120268
            },
            {
                "version": "7.17.1",
                "index_count": 23,
                "primary_shard_count": 100,
                "total_primary_bytes": 338963465090
            },
            {
                "version": "7.17.5",
                "index_count": 59,
                "primary_shard_count": 471,
                "total_primary_bytes": 1572515521648
            }
        ]
    },
    "nodes": {
        "count": {
            "total": 160,
            "coordinating_only": 0,
            "data": 157,
            "data_cold": 157,
            "data_content": 157,
            "data_frozen": 157,
            "data_hot": 157,
            "data_warm": 157,
            "ingest": 157,
            "master": 3,
            "ml": 160,
            "remote_cluster_client": 160,
            "transform": 157,
            "voting_only": 0
        },
        "versions": [
            "7.17.5"
        ],
        "os": {
            "available_processors": 1280,
            "allocated_processors": 1280,
            "names": [
                {
                    "name": "Linux",
                    "count": 160
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 20.04.5 LTS",
                    "count": 1
                },
                {
                    "pretty_name": "Ubuntu 20.04.4 LTS",
                    "count": 159
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 160
                }
            ],
            "mem": {
                "total_in_bytes": 5274458591232,
                "free_in_bytes": 91226087424,
                "used_in_bytes": 5183232503808,
                "free_percent": 2,
                "used_percent": 98
            }
        },
        "process": {
            "cpu": {
                "percent": 908
            },
            "open_file_descriptors": {
                "min": 4061,
                "max": 4935,
                "avg": 4778
            }
        },
        "jvm": {
            "max_uptime_in_millis": 2785802558,
            "versions": [
                {
                    "version": "18.0.1.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "18.0.1.1+2-6",
                    "vm_vendor": "Oracle Corporation",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 160
                }
            ],
            "mem": {
                "heap_used_in_bytes": 1066557260432,
                "heap_max_in_bytes": 2406255427584
            },
            "threads": 17558
        },
        "fs": {
            "total_in_bytes": 72495000170496,
            "free_in_bytes": 55334772748288,
            "available_in_bytes": 55334772748288
        },
        "plugins": [
            {
                "name": "repository-s3",
                "version": "7.17.5",
                "elasticsearch_version": "7.17.5",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false,
                "licensed": false,
                "type": "isolated"
            }
        ],
        "network_types": {
            "transport_types": {
                "security4": 160
            },
            "http_types": {
                "security4": 160
            }
        },
        "discovery_types": {
            "zen": 160
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 160
            }
        ],
        "ingest": {
            "number_of_pipelines": 3,
            "processor_stats": {
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "rename": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "set": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                }
            }
        }
    }
}

15TB total. We're running a realtime search system with constant reads and writes (our traffic is ~50% writes). We join multiple data sets into single documents that are updated frequently and use a pretty complex mapping (lots of child and nested docs with scripted updates) and run expensive queries and aggregations so we have to keep our shards and data/node ratio pretty small. We aim for 4-10GB per shard

What type of EBS storage?

Can you run iostat -x on a node that is having issues?

I believe they are SSD but need to double check

Linux 3.10.0-1160.80.1.el7.x86_64 (ta-cop-es-data-aws152-app.iad1.qprod.net) 	12/19/2022 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.35    0.00    0.92    0.14    0.00   94.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.68   33.29   20.16  1051.50   381.95    53.63     0.06    1.22    1.29    1.10   0.39   2.09
nvme0n1           0.00     0.06    0.78    6.08    25.25    43.45    20.03     0.01    0.92    0.89    0.93   0.66   0.46
dm-0              0.00     0.00   33.29   20.85  1051.50   381.95    52.95     0.10    1.75    1.36    2.38   0.24   1.32

If it is gp2 EBS volumes, they as far as I recall only get 3 IOPS per GB size, so your volumes would only have 90 IOPS each, which is very low and likely limiting.

We have gp3 volumes which should be 3000 IOPS. Could limited IOPS cause these requests to run indefinitely?

Our entire cluster is serving ~4.5k indexing + search req/s. I'm assuming that our thread pool queues aren't filling up / rejecting requests because only 1% of our requests are timing out and only a small percentage of those requests never complete.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.