I have a 8 node cluster (3 master and 5 data nodes). There are 5 shards and 4 replicas, so each of the data nodes have identical shard information. There is one particular node which always shows high CPU usage. I have looked at other node stats like disk space used, queries processed etc and seem identical across all nodes. I tried hot spot threads and jstack but output appears similar across nodes. How can I debug why this node is misbehaving ?
are you sure it is Elasticsearch causing the load?
Is the OS swapping?
Have you ran top and sort by CPU and then press C to show process information?
Pretty sure it is ES. Nothing else runs on the machine and top shows high
CPU usage by ES.
Do any of the shards have an unusually extra amount of documents on it?
Are you using any custom routing?
I am not using custom routing. I haven't checked number of docs in each
shard but all data is replicated in all the 5 nodes. Each node has 5 shards
ensuring it has complete copy of entire data.
I always use jstack. I usually run it a few times and dump the output to a file. I write a little bash script that tries to classify each stack trace with grep. Because I have a thing for silly bash scripts, I guess.
jstack really is the best way. If it doesn't say anything I check things like GC rates. It is probably also worth making sure that your problem node is running using the same configuration and that clients are pushing requests to the cluster randomly/round robin/whatever. Just so long as they aren't hammering that node in particular.
I have 3 master nodes in the cluster which are behind a load balancer. I am
assuming that they round robin the requests to distribute load in a
reasonable manner. I tried multiple dumps using jstack but there doesn't
seem any differences between loaded and other nodes.