And another interesting datapoint:
In the results of _cat/thread_pool?v&h=id,name,queue,rejected,completed (which will eventually return while the cluster is in this goofy state, though it takes a long time) looks like this:
id name queue rejected completed
KxB7zYzZSRyxkcknPWc5Pg analyze 0 0 0
KxB7zYzZSRyxkcknPWc5Pg ccr 0 0 0
KxB7zYzZSRyxkcknPWc5Pg fetch_shard_started 0 0 0
KxB7zYzZSRyxkcknPWc5Pg fetch_shard_store 0 0 0
KxB7zYzZSRyxkcknPWc5Pg flush 0 0 0
KxB7zYzZSRyxkcknPWc5Pg force_merge 0 0 0
KxB7zYzZSRyxkcknPWc5Pg generic 0 0 7796
KxB7zYzZSRyxkcknPWc5Pg get 0 0 0
KxB7zYzZSRyxkcknPWc5Pg listener 0 0 0
KxB7zYzZSRyxkcknPWc5Pg management 0 0 4955
KxB7zYzZSRyxkcknPWc5Pg ml_datafeed 0 0 0
KxB7zYzZSRyxkcknPWc5Pg ml_job_comms 0 0 0
KxB7zYzZSRyxkcknPWc5Pg ml_utility 0 0 845
KxB7zYzZSRyxkcknPWc5Pg refresh 0 0 0
KxB7zYzZSRyxkcknPWc5Pg rollup_indexing 0 0 0
KxB7zYzZSRyxkcknPWc5Pg search 0 0 0
KxB7zYzZSRyxkcknPWc5Pg search_throttled 0 0 0
KxB7zYzZSRyxkcknPWc5Pg searchable_snapshots_cache_fetch_async 0 0 0
KxB7zYzZSRyxkcknPWc5Pg searchable_snapshots_cache_prewarming 0 0 0
KxB7zYzZSRyxkcknPWc5Pg snapshot 0 0 0
KxB7zYzZSRyxkcknPWc5Pg system_read 0 0 126
KxB7zYzZSRyxkcknPWc5Pg system_write 0 0 0
KxB7zYzZSRyxkcknPWc5Pg transform_indexing 0 0 0
KxB7zYzZSRyxkcknPWc5Pg warmer 0 0 0
KxB7zYzZSRyxkcknPWc5Pg watcher 0 0 0
KxB7zYzZSRyxkcknPWc5Pg write 0 0 0
C1SP7qAUSCyFGWx8NpCd9w analyze 0 0 0
C1SP7qAUSCyFGWx8NpCd9w ccr 0 0 0
C1SP7qAUSCyFGWx8NpCd9w fetch_shard_started 0 0 0
C1SP7qAUSCyFGWx8NpCd9w fetch_shard_store 0 0 0
C1SP7qAUSCyFGWx8NpCd9w flush 0 0 0
C1SP7qAUSCyFGWx8NpCd9w force_merge 0 0 0
C1SP7qAUSCyFGWx8NpCd9w generic 0 0 1894
C1SP7qAUSCyFGWx8NpCd9w get 0 0 0
C1SP7qAUSCyFGWx8NpCd9w listener 0 0 0
C1SP7qAUSCyFGWx8NpCd9w management 0 0 2519
C1SP7qAUSCyFGWx8NpCd9w ml_datafeed 0 0 0
C1SP7qAUSCyFGWx8NpCd9w ml_job_comms 0 0 0
C1SP7qAUSCyFGWx8NpCd9w ml_utility 0 0 841
C1SP7qAUSCyFGWx8NpCd9w refresh 0 0 0
C1SP7qAUSCyFGWx8NpCd9w rollup_indexing 0 0 0
C1SP7qAUSCyFGWx8NpCd9w search 0 0 0
C1SP7qAUSCyFGWx8NpCd9w search_throttled 0 0 0
C1SP7qAUSCyFGWx8NpCd9w searchable_snapshots_cache_fetch_async 0 0 0
C1SP7qAUSCyFGWx8NpCd9w searchable_snapshots_cache_prewarming 0 0 0
C1SP7qAUSCyFGWx8NpCd9w snapshot 0 0 0
C1SP7qAUSCyFGWx8NpCd9w system_read 0 0 0
C1SP7qAUSCyFGWx8NpCd9w system_write 0 0 0
C1SP7qAUSCyFGWx8NpCd9w transform_indexing 0 0 0
C1SP7qAUSCyFGWx8NpCd9w warmer 0 0 0
C1SP7qAUSCyFGWx8NpCd9w watcher 0 0 0
C1SP7qAUSCyFGWx8NpCd9w write 0 0 0
Syx8kjYLTmOIgam0gsgdPA analyze 0 0 0
Syx8kjYLTmOIgam0gsgdPA ccr 0 0 0
Syx8kjYLTmOIgam0gsgdPA fetch_shard_started 0 0 0
Syx8kjYLTmOIgam0gsgdPA fetch_shard_store 0 0 0
Syx8kjYLTmOIgam0gsgdPA flush 0 0 0
Syx8kjYLTmOIgam0gsgdPA force_merge 0 0 0
Syx8kjYLTmOIgam0gsgdPA generic 0 0 1907
Syx8kjYLTmOIgam0gsgdPA get 0 0 0
Syx8kjYLTmOIgam0gsgdPA listener 0 0 0
Syx8kjYLTmOIgam0gsgdPA management 0 0 2520
Syx8kjYLTmOIgam0gsgdPA ml_datafeed 0 0 0
Syx8kjYLTmOIgam0gsgdPA ml_job_comms 0 0 0
Syx8kjYLTmOIgam0gsgdPA ml_utility 0 0 843
Syx8kjYLTmOIgam0gsgdPA refresh 0 0 0
Syx8kjYLTmOIgam0gsgdPA rollup_indexing 0 0 0
Syx8kjYLTmOIgam0gsgdPA search 0 0 0
Syx8kjYLTmOIgam0gsgdPA search_throttled 0 0 0
Syx8kjYLTmOIgam0gsgdPA searchable_snapshots_cache_fetch_async 0 0 0
Syx8kjYLTmOIgam0gsgdPA searchable_snapshots_cache_prewarming 0 0 0
Syx8kjYLTmOIgam0gsgdPA snapshot 0 0 0
Syx8kjYLTmOIgam0gsgdPA system_read 0 0 0
Syx8kjYLTmOIgam0gsgdPA system_write 0 0 0
Syx8kjYLTmOIgam0gsgdPA transform_indexing 0 0 0
Syx8kjYLTmOIgam0gsgdPA warmer 0 0 0
Syx8kjYLTmOIgam0gsgdPA watcher 0 0 0
Syx8kjYLTmOIgam0gsgdPA write 0 0 0
zMUxvgq5RL-K5xFP_svEFw analyze 0 0 0
zMUxvgq5RL-K5xFP_svEFw ccr 0 0 0
zMUxvgq5RL-K5xFP_svEFw fetch_shard_started 0 0 212
zMUxvgq5RL-K5xFP_svEFw fetch_shard_store 0 0 24
zMUxvgq5RL-K5xFP_svEFw flush 0 0 32
zMUxvgq5RL-K5xFP_svEFw force_merge 0 0 0
zMUxvgq5RL-K5xFP_svEFw generic 0 0 23453
zMUxvgq5RL-K5xFP_svEFw get 0 0 0
zMUxvgq5RL-K5xFP_svEFw listener 0 0 0
zMUxvgq5RL-K5xFP_svEFw management 0 0 5875
zMUxvgq5RL-K5xFP_svEFw ml_datafeed 0 0 0
zMUxvgq5RL-K5xFP_svEFw ml_job_comms 0 0 0
zMUxvgq5RL-K5xFP_svEFw ml_utility 0 0 622
zMUxvgq5RL-K5xFP_svEFw refresh 0 0 24187
zMUxvgq5RL-K5xFP_svEFw rollup_indexing 0 0 0
zMUxvgq5RL-K5xFP_svEFw search 0 0 0
... cutting some for post length ...
Lq-k-ysUSC-9aGDDs9nE7g transform_indexing 0 0 0
Lq-k-ysUSC-9aGDDs9nE7g warmer 0 0 0
Lq-k-ysUSC-9aGDDs9nE7g watcher 0 0 0
Lq-k-ysUSC-9aGDDs9nE7g write 0 0 0
GPhxT1vaSJe28koqMC1lrA analyze 0 0 0
GPhxT1vaSJe28koqMC1lrA ccr 0 0 0
GPhxT1vaSJe28koqMC1lrA fetch_shard_started 0 0 212
GPhxT1vaSJe28koqMC1lrA fetch_shard_store 0 0 24
GPhxT1vaSJe28koqMC1lrA flush 0 0 30
GPhxT1vaSJe28koqMC1lrA force_merge 0 0 0
GPhxT1vaSJe28koqMC1lrA generic 0 0 53414
GPhxT1vaSJe28koqMC1lrA get 0 0 0
GPhxT1vaSJe28koqMC1lrA listener 0 0 0
GPhxT1vaSJe28koqMC1lrA management 14442 0 23908
GPhxT1vaSJe28koqMC1lrA ml_datafeed 0 0 0
GPhxT1vaSJe28koqMC1lrA ml_job_comms 0 0 0
GPhxT1vaSJe28koqMC1lrA ml_utility 0 0 712
GPhxT1vaSJe28koqMC1lrA refresh 0 0 72130
GPhxT1vaSJe28koqMC1lrA rollup_indexing 0 0 0
GPhxT1vaSJe28koqMC1lrA search 0 0 0
GPhxT1vaSJe28koqMC1lrA search_throttled 0 0 0
GPhxT1vaSJe28koqMC1lrA searchable_snapshots_cache_fetch_async 0 0 0
GPhxT1vaSJe28koqMC1lrA searchable_snapshots_cache_prewarming 0 0 0
GPhxT1vaSJe28koqMC1lrA snapshot 0 0 0
GPhxT1vaSJe28koqMC1lrA system_read 0 0 177
GPhxT1vaSJe28koqMC1lrA system_write 0 0 69
GPhxT1vaSJe28koqMC1lrA transform_indexing 0 0 0
GPhxT1vaSJe28koqMC1lrA warmer 0 0 358
GPhxT1vaSJe28koqMC1lrA watcher 0 0 0
GPhxT1vaSJe28koqMC1lrA write 113 0 125151
Notably, when I refresh, the management queue keeps increasing. In my case, I have my master nodes separate from my data nodes, so I'd think management tasks shouldn't get backed up, but perhaps I've configured something wrong?
Today I've tried tweaking memory amounts, pinning my containers to cpu ranges so I know nothing is overlapping that way, adding a second data node, and a few other tweaks, but nothing has improved the situation--when the data nodes are actively indexing, their stats calls will never respond. If I turn off logstash, all calls instantly return, so something is definitely getting starved out. Non-stats data requests still go through just fine.
It actually sounds very similar to this issue, however I don't see any completionStats requests coming through on our hot threads--or any management tasks at all, really (which makes sense if they're just getting outright starved.)