Last few weeks i have performance troubles with write queries (there is bigger load than earlier). I think, that from app side, there is nothing to optimize - all queries ar sent to _bulk with about 300 operations per request. Most of them are updates and most of them uses update scripts in painless (there is some non-trivial logic). Average duration of one bulk request is about 8-12s, which is terrible. It must be something about 1s (and it earlier was about that time).
All requests goes through RabbitMQ, so I can make some stats about the requests and throttle speed,...
Hardware - we had 3 nodes with enough space, cpu and RAM. (its some virtual servers, but with fast storage) Everything looks good. We tried to add forth node, but without any performance impact. Btw each node has all roles (master, data, ingest,...)
Do you have any suggestions what metrics to watch and how to solve our problem?
I have one idea, but not sure if it can help. I can divide indices into 2 groups, where is almost equal count of writes. Then these indices allocate to different group of nodes (so there will be 2+2 nodes, each containing only one group of indices). Then update the workers (which listens on RabbitMQ queues) to sent requests only to apropriate nodes of cluster.
Splitting to 2 clusters is not possible, because there are also some read requests which needs data from both groups of indices. (read requests are fast enough).