How to tack performance issues?

Hi List,

I am seeing pretty long times on some of my _cat queries and wanted to run them by you to see if this is expected:

curl 'localhost:9200/_cat/nodes'
real    3m42.792s
user    0m0.024s
sys     0m0.000s

curl 'localhost:9200/_cat/indices'
real    0m16.399s
user    0m0.004s
sys     0m0.004s

These are run on two 'client' nodes that are not servicing any queries currently. Is this expected? I took out a data node about 8 hours ago because it was constantly resulting in messages like these on the master and topbeat showed it was quite overloaded for the past several days:

[2016-01-07 23:58:33,202][WARN ][transport                ] [bxb-sln-vm97] Received response for a request that has timed out, sent [161674ms] ago, timed out [146674ms] ago, action [cluster:monitor/nodes/stats[n]], node [{bxb-sln-srv-4}{8nI6Rm7vT5-vlit27tAwDA}{10.86.205.57}{10.86.205.57:9300}{master=false}], id [4241130]
[2016-01-07 23:59:05,559][WARN ][transport                ] [bxb-sln-vm97] Received response for a request that has timed out, sent [134029ms] ago, timed out [119029ms] ago, action [cluster:monitor/nodes/stats[n]], node [{bxb-sln-srv-4}{8nI6Rm7vT5-vlit27tAwDA}{10.86.205.57}{10.86.205.57:9300}{master=false}], id [4241904]
[2016-01-07 23:59:06,532][DEBUG][action.admin.cluster.node.stats] [bxb-sln-vm97] failed to execute on node [8nI6Rm7vT5-vlit27tAwDA]
ReceiveTimeoutTransportException[[bxb-sln-srv-4][10.86.205.57:9300][cluster:monitor/nodes/stats[n]] request_id [4243442] timed out after [15000ms]]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:645)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

The indices are still yellow so perhaps shards and replicas are being moved around after I took that data node offline.

Another fundamental question: my cluster consists of heterogenous machines:

  1. data nodes: several blades with 8G RAM with spinning disks
  2. data nodes: several VMs wtih 16G RAM and spinning disks
  3. Masters on VMs with 8G RAM and spinning disks
  4. Client Nodes: VMs with 32G RAM and spinning disks.

Being flexible, I dont think ES should suffer from heterogeneity, but still wanted to run this by the list to get some thoughts on anything to keep a look out for or optimize.

Thanks

It's not.

Possibly. How large were the relocating shards? What's your network infrastructure?

How many cores do these boxes have?

Are the VMs sharing hosts? Do the hosts have any other busy guests? What's the spindle setup on the host? What's are the network interfaces on the host? Are they dedicated to guests, or shared?

How much heap do you have allocated to the data nodes?

Do you see any messages in the logs about long garbage collection pauses?

What else was occurring on the cluster at the time that you made this request?

Are you able to reproduce this now, after the aforementioned recovery is complete? If so, while a long-running _cat/nodes request is being serviced, can you use the hot threads API and share it here?

No, you'll just see differing response times from the nodes, but I would not expect anything egregious like this.