1 of 10 nodes CPU bound

We have a cluster with 10 machines with 10 data nodes, 3 master nodes and 10 replicated shards.
Sometimes, we run many percolations and searches, and the CPU usages raises on all nodes, but not evenly. 1 node rises to 100% CPU usages while all the others rises to 50-80% of CPU usage. This makes most searches really slow (~20-25s).

  • Can I determine why this node takes 100% CPU time but not the others?
  • Can I do something to distribute the load more evenly?

Do you use custom routing of docs? This may account for uneven-ness.

This stands out as a possible culprit. If you have a particularly nasty percolator query that might be the cause of the imbalance.
The best way to start is to use the hot threads API to see what's going on while under CPU pressure:

No.

It fluctuates highly. Most of the time it shows only 3 hot threads (on a hardware with 4 hyper-threaded CPUs with ES taking 800% of CPU time).
Shouldn't it show me 8 hot threads ?

OK, I read the doc...

threads: number of hot threads to provide, defaults to 3.

Now I can see that all threads are doing searches.
How can I find from the call stacks what is expensive in the searches?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.