We recently saw in one of our cluster that indexing rate went down (almost to zero) due to high garbage collection in few nodes, percentage of such nodes accounts to only 1-2% of the whole cluster. We have ~200 data nodes in our cluster, with separate master & client nodes.
I don't think all the bulk indexing request was redirected towards these 2-3 nodes only. We use bulk indexing with a batch size of ~5 MB, with bulk queue size set to -1 with sniff enabled. During this event initially search requests were high but eventually they became almost negligible but cluster still had high GC. Hours after the search requests became too less, I can still see slow logs for indexing and query. Weird fact is that number of search queries at this point were too less but number of slow logs printed for fetch_query is high compared to that number. Are these queued requests ?
Is there any threshold on age configured for indexing and search requests already queued in the queue, as in is there something like after x seconds if these requests are executed from the queue they will be discarded? If a node has overwhelming number of search and indexing requests in the queue which will be given priority.
How can a small percentage of slow hosts impact the performance of whole cluster, is there something for which a large percentage of hosts are waiting for the slow hosts to respond. Any cluster wide activity or something. I searched for a while on the internet and found some instances that reported the same observation but none of them had a reason for so.