Thread selection and locking

Hi all,
We have a encountered a slow down of our elasticsearch services, and profiling showed us that most time is spent on java.nio.channels.selector.SelectImpl.select(), which I think means es is waiting for next available thread. We also looked at hot threads and in many cases we get something like this:

79.5% (397.6ms out of 500ms) cpu usage by thread 'elasticsearch[ip-192-168-102-226-gloo][get][T#1]'
 10/10 snapshots sharing following 10 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
   java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
   java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
   org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
   java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:745)

What is the best way to try to optimize around something like this. We have a throughput of a few million queries a day but they are not uniformly distributed.
Thanks.

add more power to the cluster by adding more nodes?

The thing is that we have 2 nodes which are quite powerful. Would a configuration with more less powerful nodes be better as more threads would be available?

sometime it is difficult to judge when should you add more nodes. from personal experience and from empirical background; my advice is, you better measure everything for all the nodes and monitor as much as possible. anything can go wrong and with the metric history, it will give you a quick decision on the spot, what should you do.

you should ( or must) have overall view on the entire interconnected system and fix the problem. The thread pool get in the snippet above may the direct indicator why system become slow. but sometime fixing direct sight may not fix the root cause.

i know this is a general answer than the one you encounter above but i hope this give you a good lesson to prevent such things from happening again.

hth

That stacktrace means that that thread is waiting for work. This describes what is happening. You can get a better guess as to what is actually taking up the time by using a few jstack snapshots. You'll see lots of threads just sitting there like this waiting.