I have recently "inherited" an ES cluster at work that used to run just fine (since I inherited at least) but recently has been experiencing extreme performance issues around indexing. We have 20 "processor" containers in our pipeline that each index data to the cluster. Previously each of these containers could index at a rate of at least 50 docs per second (all single requests no bulk api). Originally the cluster was sized with 8 nodes. Over time we saw performance degrade dramatically (<= 2 doc per second). I realized the cluster was managing ~24000 shards! The original developers made some poor choices regarding indexing strategy (many of the shards are small too like on the order of MB's in size). Re-indexing was painfully slow at the time so my thinking was to scale out to get under the 20 shards per GB of heap recommendation (each data node has ~30GB heap) to get head-room to reindex. I scaled out to 48 nodes total ~24000/48 = ~500 shards per node. It made very little difference in my indexing speed (<= 2-3 docs per second). Reindexing was better but still felt slow. The logs for the nodes show a bunch of transport exceptions (not sure what these are about) and GC occurring. I find the GC situation odd because the heap is sized according to the recommendations (half of RAM and not more than 30GB). Below is a log snippet from one node that is fairly "active" I notice other nodes just have a ton of GC messages in their logs (seem idle?). Here is also a link to my current node stats. I am at a total loss of what to do or try next or even where to look, any help would be greatly appreciated.
[2020-12-12T19:58:37,845][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][86198] overhead, spent [339ms] collecting in the last [1s]
[2020-12-12T20:02:52,132][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1a-1}{b0RKcobaS7SeTIYPv2xZZQ}{bfXYE8QMTDC6CRwidyEWdg}{100.103.95.130}{100.103.95.130:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5198]])
[2020-12-12T20:38:30,253][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88585] overhead, spent [312ms] collecting in the last [1s]
[2020-12-12T20:38:31,677][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] removed {{elasticsearch-elasticsearch-data-1b-13}{_GWTqqXVRnmKr9r5fE2-zA}{I4Zdj1yfThGV7MeSh0_ahg}{100.118.195.67}{100.118.195.67:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5747]])
[2020-12-12T20:38:31,720][INFO ][o.e.i.s.IndexShard ] [elasticsearch-elasticsearch-data-1b-4] [78827-ca-cosumnes_community_services_district_fire_department-apparatus-fire-incident-2018-10-es6apparatusfireincident][1] primary-replica resync completed with 0 operations
[2020-12-12T20:39:35,412][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88650] overhead, spent [319ms] collecting in the last [1s]
[2020-12-12T20:39:41,508][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][88656] overhead, spent [349ms] collecting in the last [1s]
[2020-12-12T20:47:06,548][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1b-13}{_GWTqqXVRnmKr9r5fE2-zA}{1HjoV0UoS0Wr0x1EJM8L9w}{100.106.163.66}{100.106.163.66:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [5766]])
[2020-12-12T21:42:32,764][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92422] overhead, spent [313ms] collecting in the last [1.1s]
[2020-12-12T21:42:34,495][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] removed {{elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{IpRLByWFTSyU2hAWkPwwgg}{100.105.197.132}{100.105.197.132:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [6264]])
[2020-12-12T21:43:04,028][WARN ][o.e.c.NodeConnectionsService] [elasticsearch-elasticsearch-data-1b-4] failed to connect to node {elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{IpRLByWFTSyU2hAWkPwwgg}{100.105.197.132}{100.105.197.132:9300}{xpack.installed=true} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elasticsearch-elasticsearch-data-1b-12][100.105.197.132:9300] connect_timeout[30s]
at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:643) ~[elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:542) ~[elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:329) ~[elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:316) ~[elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:153) [elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:180) [elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.4.1.jar:6.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.4.1.jar:6.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
at java.lang.Thread.run(Thread.java:844) [?:?]
[2020-12-12T21:43:40,972][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92490] overhead, spent [349ms] collecting in the last [1s]
[2020-12-12T21:43:45,060][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][92494] overhead, spent [353ms] collecting in the last [1s]
[2020-12-12T21:49:54,693][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch-elasticsearch-data-1b-4] added {{elasticsearch-elasticsearch-data-1b-12}{rZx5hg0RQ36XIc1K0PhYDg}{lJBy6aOPTu66j6br-Y2pMA}{100.106.181.2}{100.106.181.2:9300}{xpack.installed=true},}, reason: apply cluster state (from master [master {elasticsearch-elasticsearch-master-2}{9Us1bbUnQWuak-hZ_Cq19w}{Zldki3ouSmGNcSpf0zUU_Q}{100.111.227.67}{100.111.227.67:9300}{xpack.installed=true} committed version [6284]])
[2020-12-12T21:59:58,904][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93466] overhead, spent [301ms] collecting in the last [1s]
[2020-12-12T22:00:01,905][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93469] overhead, spent [333ms] collecting in the last [1s]
[2020-12-12T22:06:45,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93872] overhead, spent [334ms] collecting in the last [1s]
[2020-12-12T22:06:46,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93873] overhead, spent [331ms] collecting in the last [1s]
[2020-12-12T22:06:48,610][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93875] overhead, spent [309ms] collecting in the last [1s]
[2020-12-12T22:06:52,612][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][93879] overhead, spent [302ms] collecting in the last [1s]
[2020-12-12T22:09:57,714][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94064] overhead, spent [333ms] collecting in the last [1s]
[2020-12-12T22:13:56,347][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94302] overhead, spent [319ms] collecting in the last [1.1s]
[2020-12-12T22:13:58,348][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94304] overhead, spent [288ms] collecting in the last [1s]
[2020-12-12T22:14:57,393][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94363] overhead, spent [340ms] collecting in the last [1s]
[2020-12-12T22:15:00,510][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94366] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T22:15:01,510][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][94367] overhead, spent [337ms] collecting in the last [1s]
[2020-12-12T22:27:56,060][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95140] overhead, spent [320ms] collecting in the last [1s]
[2020-12-12T22:27:57,061][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95141] overhead, spent [317ms] collecting in the last [1s]
[2020-12-12T22:41:39,556][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95963] overhead, spent [316ms] collecting in the last [1s]
[2020-12-12T22:41:41,556][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95965] overhead, spent [309ms] collecting in the last [1s]
[2020-12-12T22:41:42,557][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][95966] overhead, spent [311ms] collecting in the last [1s]
[2020-12-12T22:44:55,637][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96159] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T22:47:46,693][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96330] overhead, spent [318ms] collecting in the last [1s]
[2020-12-12T22:47:47,693][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96331] overhead, spent [329ms] collecting in the last [1s]
[2020-12-12T22:58:04,072][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][96947] overhead, spent [325ms] collecting in the last [1s]
[2020-12-12T23:11:34,539][INFO ][o.e.m.j.JvmGcMonitorService] [elasticsearch-elasticsearch-data-1b-4] [gc][97757] overhead, spent [332ms] collecting in the last [1s]