Search and bulk transport tasks stuck running for days on ES 7.17.5

joannas_q · December 17, 2022, 12:02am

elasticsearch version: 7.17.5
lucene_version: 8.11.1

Hi, we have two elasticsearch clusters that have had degraded search and ingestion performance over the past two+ weeks. We have search/scroll/count/bulk client requests that usually take less than 2s running for > 17 minutes. We found that both clusters had tasks running for days. Cluster A had thousands of tasks running for > 10 days. Cluster B had tasks running for four days. We've scaled out both clusters (doubling the number of data nodes in Cluster B) but haven't seen any improvements. We only see improved performance after restarting the nodes with long running tasks. But after a few hours, we start to see long running tasks building up and eventually experience degraded performance again.

Most of the long running tasks are concentrated on a small number of nodes (different nodes each time), but we've seen up to 15 nodes with tasks running for longer than an hour. The cluster doesn't have increased CPU or heap usage and isn't returning 503s or 429s.

We've noticed is that our cluster data size decreases once these nodes have restarted, so we suspect that the cluster doesn't have enough resources to merge segments and clean up deleted documents. I haven't seen much correlation between specific shards and the long running tasks but the tasks tend to belong to our highest traffic indices.

Cluster A:
AWS EC2 m5.2xlarge
data size: 14.8 TB

{
    "cluster_name": "",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 160,
    "number_of_data_nodes": 157,
    "active_primary_shards": 2442,
    "active_shards": 4952,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Cluster B:
AWS EC2 r5.xlarge
data size: 3.33TB

{
    "cluster_name": "",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 66,
    "number_of_data_nodes": 63,
    "active_primary_shards": 880,
    "active_shards": 1781,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0,
    "delayed_unassigned_shards": 0,
    "number_of_pending_tasks": 0,
    "number_of_in_flight_fetch": 0,
    "task_max_waiting_in_queue_millis": 0,
    "active_shards_percent_as_number": 100.0
}

Do you know of any reasons why these tasks would be running for so long?

Any help would be greatly appreciated. Thanks in advance.

Christian_Dahlqvist · December 17, 2022, 7:32am

Elasticsearch is often limited by storage performance. What type of storage are you using? How much storage does each node have? Have you looked at storage performance, e.g. using iostat?

What does the hot threads API show on the nodes that are slow?

What is the full output of the cluster stats API?

leandrojmp · December 17, 2022, 2:51pm

Is this correct? You have 157 m5.2xlarge data nodes for 15 TB of data? Or is this per node?

joannas_q · December 19, 2022, 6:28pm

Each node has 30GiB of EBS storage with 15GB heap

We haven't taken a close look at our storage perf, but this is the iostat output for on of the nodes

Linux 3.10.0-1160.80.1.el7.x86_64 (data-152) 	12/19/2022 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.33    0.00    0.92    0.14    0.00   94.60

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
nvme1n1          53.04      1044.55       379.24  504915932  183318052
nvme0n1           6.84        24.73        43.46   11955215   21006882
dm-0             53.72      1044.55       379.24  504914849  183318052

Here's the hot_threads output for a few of the nodes with 3 day old tasks

::: {data-150}{}{}{}{}{}{ml.machine_memory=32727298048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:32:28.977Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {data-207}{}{}{}{}{}{ml.machine_memory=33079619584, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:34:35.618Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

::: {data-152}{}{}{}{}{}{ml.machine_memory=32727298048, ml.max_open_jobs=512, xpack.installed=true, ml.max_jvm_size=15032385536, transform.node=true}
   Hot threads at 2022-12-19T17:35:56.294Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

Cluster stats output:

{
    "_nodes": {
        "total": 160,
        "successful": 160,
        "failed": 0
    },
    "cluster_name": "",
    "cluster_uuid": "",
    "timestamp": 1671471473934,
    "status": "green",
    "indices": {
        "count": 371,
        "shards": {
            "total": 4992,
            "primaries": 2462,
            "replication": 1.0276198212835093,
            "index": {
                "shards": {
                    "min": 2,
                    "max": 142,
                    "avg": 13.455525606469003
                },
                "primaries": {
                    "min": 1,
                    "max": 71,
                    "avg": 6.636118598382749
                },
                "replication": {
                    "min": 1.0,
                    "max": 68.0,
                    "avg": 1.1832884097035041
                }
            }
        },
        "docs": {
            "count": 107318112563,
            "deleted": 22390249222
        },
        "store": {
            "size_in_bytes": 16569470119330,
            "total_data_set_size_in_bytes": 16569470119330,
            "reserved_in_bytes": 0
        },
        "fielddata": {
            "memory_size_in_bytes": 9425147456,
            "evictions": 0
        },
        "query_cache": {
            "memory_size_in_bytes": 20508901766,
            "total_count": 12615585675,
            "hit_count": 233399010,
            "miss_count": 12382186665,
            "cache_size": 4791806,
            "cache_count": 6632333,
            "evictions": 1840527
        },
        "completion": {
            "size_in_bytes": 0
        },
        "segments": {
            "count": 96858,
            "memory_in_bytes": 1410578530,
            "terms_memory_in_bytes": 462039416,
            "stored_fields_memory_in_bytes": 649930480,
            "term_vectors_memory_in_bytes": 0,
            "norms_memory_in_bytes": 4330624,
            "points_memory_in_bytes": 0,
            "doc_values_memory_in_bytes": 294278010,
            "index_writer_memory_in_bytes": 739513238,
            "version_map_memory_in_bytes": 15387294,
            "fixed_bit_set_memory_in_bytes": 32153610840,
            "max_unsafe_auto_id_timestamp": 1671408000479,
            "file_sizes": {}
        },
        "mappings": {
            "field_types": [
                {
                    "name": "boolean",
                    "count": 1245,
                    "index_count": 359,
                    "script_count": 0
                },
                {
                    "name": "constant_keyword",
                    "count": 6,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "date",
                    "count": 6273,
                    "index_count": 376,
                    "script_count": 0
                },
                {
                    "name": "double",
                    "count": 1029,
                    "index_count": 342,
                    "script_count": 0
                },
                {
                    "name": "float",
                    "count": 70,
                    "index_count": 15,
                    "script_count": 0
                },
                {
                    "name": "half_float",
                    "count": 50,
                    "index_count": 12,
                    "script_count": 0
                },
                {
                    "name": "integer",
                    "count": 2200,
                    "index_count": 351,
                    "script_count": 0
                },
                {
                    "name": "ip",
                    "count": 2,
                    "index_count": 2,
                    "script_count": 0
                },
                {
                    "name": "join",
                    "count": 342,
                    "index_count": 342,
                    "script_count": 0
                },
                {
                    "name": "keyword",
                    "count": 17945,
                    "index_count": 377,
                    "script_count": 0
                },
                {
                    "name": "long",
                    "count": 2680,
                    "index_count": 368,
                    "script_count": 0
                },
                {
                    "name": "nested",
                    "count": 1062,
                    "index_count": 359,
                    "script_count": 0
                },
                {
                    "name": "object",
                    "count": 921,
                    "index_count": 35,
                    "script_count": 0
                },
                {
                    "name": "scaled_float",
                    "count": 340,
                    "index_count": 340,
                    "script_count": 0
                },
                {
                    "name": "text",
                    "count": 545,
                    "index_count": 269,
                    "script_count": 0
                },
                {
                    "name": "version",
                    "count": 10,
                    "index_count": 10,
                    "script_count": 0
                }
            ],
            "runtime_field_types": []
        },
        "analysis": {
            "char_filter_types": [],
            "tokenizer_types": [],
            "filter_types": [],
            "analyzer_types": [
                {
                    "name": "keyword",
                    "count": 345,
                    "index_count": 345
                }
            ],
            "built_in_char_filters": [],
            "built_in_tokenizers": [],
            "built_in_filters": [],
            "built_in_analyzers": []
        },
        "versions": [
            {
                "version": "6.5.4",
                "index_count": 1,
                "primary_shard_count": 1,
                "total_primary_bytes": 61501
            },
            {
                "version": "6.8.7",
                "index_count": 8,
                "primary_shard_count": 116,
                "total_primary_bytes": 2189325034023
            },
            {
                "version": "6.8.9",
                "index_count": 93,
                "primary_shard_count": 590,
                "total_primary_bytes": 1131590486429
            },
            {
                "version": "6.8.10",
                "index_count": 173,
                "primary_shard_count": 1077,
                "total_primary_bytes": 2042697418122
            },
            {
                "version": "6.8.21",
                "index_count": 34,
                "primary_shard_count": 321,
                "total_primary_bytes": 1000801120268
            },
            {
                "version": "7.17.1",
                "index_count": 23,
                "primary_shard_count": 100,
                "total_primary_bytes": 338963465090
            },
            {
                "version": "7.17.5",
                "index_count": 59,
                "primary_shard_count": 471,
                "total_primary_bytes": 1572515521648
            }
        ]
    },
    "nodes": {
        "count": {
            "total": 160,
            "coordinating_only": 0,
            "data": 157,
            "data_cold": 157,
            "data_content": 157,
            "data_frozen": 157,
            "data_hot": 157,
            "data_warm": 157,
            "ingest": 157,
            "master": 3,
            "ml": 160,
            "remote_cluster_client": 160,
            "transform": 157,
            "voting_only": 0
        },
        "versions": [
            "7.17.5"
        ],
        "os": {
            "available_processors": 1280,
            "allocated_processors": 1280,
            "names": [
                {
                    "name": "Linux",
                    "count": 160
                }
            ],
            "pretty_names": [
                {
                    "pretty_name": "Ubuntu 20.04.5 LTS",
                    "count": 1
                },
                {
                    "pretty_name": "Ubuntu 20.04.4 LTS",
                    "count": 159
                }
            ],
            "architectures": [
                {
                    "arch": "amd64",
                    "count": 160
                }
            ],
            "mem": {
                "total_in_bytes": 5274458591232,
                "free_in_bytes": 91226087424,
                "used_in_bytes": 5183232503808,
                "free_percent": 2,
                "used_percent": 98
            }
        },
        "process": {
            "cpu": {
                "percent": 908
            },
            "open_file_descriptors": {
                "min": 4061,
                "max": 4935,
                "avg": 4778
            }
        },
        "jvm": {
            "max_uptime_in_millis": 2785802558,
            "versions": [
                {
                    "version": "18.0.1.1",
                    "vm_name": "OpenJDK 64-Bit Server VM",
                    "vm_version": "18.0.1.1+2-6",
                    "vm_vendor": "Oracle Corporation",
                    "bundled_jdk": true,
                    "using_bundled_jdk": true,
                    "count": 160
                }
            ],
            "mem": {
                "heap_used_in_bytes": 1066557260432,
                "heap_max_in_bytes": 2406255427584
            },
            "threads": 17558
        },
        "fs": {
            "total_in_bytes": 72495000170496,
            "free_in_bytes": 55334772748288,
            "available_in_bytes": 55334772748288
        },
        "plugins": [
            {
                "name": "repository-s3",
                "version": "7.17.5",
                "elasticsearch_version": "7.17.5",
                "java_version": "1.8",
                "description": "The S3 repository plugin adds S3 repositories",
                "classname": "org.elasticsearch.repositories.s3.S3RepositoryPlugin",
                "extended_plugins": [],
                "has_native_controller": false,
                "licensed": false,
                "type": "isolated"
            }
        ],
        "network_types": {
            "transport_types": {
                "security4": 160
            },
            "http_types": {
                "security4": 160
            }
        },
        "discovery_types": {
            "zen": 160
        },
        "packaging_types": [
            {
                "flavor": "default",
                "type": "docker",
                "count": 160
            }
        ],
        "ingest": {
            "number_of_pipelines": 3,
            "processor_stats": {
                "gsub": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "rename": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "script": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                },
                "set": {
                    "count": 0,
                    "failed": 0,
                    "current": 0,
                    "time_in_millis": 0
                }
            }
        }
    }
}

joannas_q · December 19, 2022, 6:36pm

15TB total. We're running a realtime search system with constant reads and writes (our traffic is ~50% writes). We join multiple data sets into single documents that are updated frequently and use a pretty complex mapping (lots of child and nested docs with scripted updates) and run expensive queries and aggregations so we have to keep our shards and data/node ratio pretty small. We aim for 4-10GB per shard

Christian_Dahlqvist · December 19, 2022, 7:09pm

What type of EBS storage?

Can you run iostat -x on a node that is having issues?

joannas_q · December 19, 2022, 8:02pm

I believe they are SSD but need to double check

Linux 3.10.0-1160.80.1.el7.x86_64 (ta-cop-es-data-aws152-app.iad1.qprod.net) 	12/19/2022 	_x86_64_	(8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.35    0.00    0.92    0.14    0.00   94.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme1n1           0.00     0.68   33.29   20.16  1051.50   381.95    53.63     0.06    1.22    1.29    1.10   0.39   2.09
nvme0n1           0.00     0.06    0.78    6.08    25.25    43.45    20.03     0.01    0.92    0.89    0.93   0.66   0.46
dm-0              0.00     0.00   33.29   20.85  1051.50   381.95    52.95     0.10    1.75    1.36    2.38   0.24   1.32

Christian_Dahlqvist · December 19, 2022, 8:12pm

If it is gp2 EBS volumes, they as far as I recall only get 3 IOPS per GB size, so your volumes would only have 90 IOPS each, which is very low and likely limiting.

joannas_q · December 19, 2022, 10:37pm

We have gp3 volumes which should be 3000 IOPS. Could limited IOPS cause these requests to run indefinitely?

Our entire cluster is serving ~4.5k indexing + search req/s. I'm assuming that our thread pool queues aren't filling up / rejecting requests because only 1% of our requests are timing out and only a small percentage of those requests never complete.

system · January 16, 2023, 10:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES slowness and calls taking a long time to return Elasticsearch	5	623	July 6, 2017
Garbage collection pauses causing cluster to get unresponsive Elasticsearch	20	1955	July 6, 2017
Cluster not able to keep up? Elasticsearch	12	4238	July 6, 2017
Very sluggish Elasticsearch node; not sure why Elasticsearch	3	527	July 6, 2017
Timing out while indexing Elasticsearch	30	15195	July 6, 2017

Search and bulk transport tasks stuck running for days on ES 7.17.5

Related topics