We have a 184 nodes Elastic 6.4 cluster, with each node having 16 cores and 64 GB of memory. We have around 270 shards (in total), totaling 17 Billion documents indexed (including Replicas). We have 3 replicas per shard.
The issue is when we run ETL processes on other databases to update the elastic cluster we use to struggle a lot on search latencies and response time to our site users. When that happen only 5-10% of the nodes suffer CPU constraints (cpu usage + iowait) and the rest of the keep calm.
MY question is, with so much cores per machine, should we increase the number of shards per machine? We only have 3 indexes in the cluster, with the greatest one representing more than 85% of the total documents.
The issue is that when we have problems on a small number of nodes, and for instance some weeks ago we had one node down that almost took it down our site.
What are the suggestions that you have for us?