We have an ES cluster with 12 percolator nodes. Today, ingestion of data got slow and we noticed 2 of the percolator nodes were using more CPU than the other 10. In this image, user CPU usage of problematic percolators is in green and yellow:
We tried restarting ES in the green node at 11:30am and 11:34am. Both times, ingestion of data improved immediately, but once ES restarted, the CPU got high again and ingestion slowed down again.
For this reason, we decided to terminate the percolator machine at 11:42am, ingestion of data immediately sped up, and we didn't have any more issues.
Why can it be that one percolator node has CPU so high while the others are sitting around 20% and 40%?
Notice we have every percolator index configured for 1 primary shard and 11 replicas, so that every percolator node has one copy of each shard (in an attempt to spread the load among the machines).
We are thinking about terminating the yellow node now, but is there anything we can do before to troubleshoot why ES is using so much CPU?
(The node we terminated only had replicas, no primaries, not sure if that matters.)