After the upgrade to 5.3 in a relatively simple setup the master node runs into timeouts (org.elasticsearch.transport.ReceiveTimeoutTransportException) when contacting the data nodes if we try to add a new index pattern in Kibana.
The setup is as follows:
- 1 master node (monesmaster) with 28 GByte of RAM, ES starts with -Xms10g -Xmx10g
- 3 data nodes (mones1, 2 and 3) each with 14 GByte of RAM, ES starts with -Xms7g -Xmx7g
- ES has 154 indices, ~ 1500 shards, ~18Mio documents
- ~145 of the indices are logstash indices (with default settings)
Once Kibana starts looking for the Time-field name the following happens:
- CPU load increases on all data nodes
- all data nodes report a heavy memory load (lots of message like: [gc][4593] overhead, spent [257ms] collecting in the last [1s])
- master node reports request timeouts (org.elasticsearch.transport.ReceiveTimeoutTransportException: [mones1][10.20.1.21:9300][cluster:monitor/nodes/stats[n]] request_id [38670] timed out after [15001ms])
- Kibana reports a timeout in the UI (default after 30 seconds)
- eventually (couple of minutes) the master node reports that it has received responses for requests that have timed out ( [monesmaster] Received response for a request that has timed out, sent [30570ms] ago, timed out [15569ms] ago, action [cluster:monitor/nodes/stats[n]], node [{mones1})
- the data nodes recover (load gets back to normal, queries work normally)
Even when trying to add a new index pattern for an indices that contains only 1 document the same problem occurs.
Is there a way to trace what kind of query is issued by Kibana that leads to such a heavy load ?
Or is there a workaround (time-out increase, etc.) to solve this ?
Thanks for any help. I really appreciate it.