Elasticsearch (5.3) data nodes become unresponsive (gc overhead) when trying to add new index pattern in Kibana

letre · March 29, 2017, 3:57pm

After the upgrade to 5.3 in a relatively simple setup the master node runs into timeouts (org.elasticsearch.transport.ReceiveTimeoutTransportException) when contacting the data nodes if we try to add a new index pattern in Kibana.

The setup is as follows:

1 master node (monesmaster) with 28 GByte of RAM, ES starts with -Xms10g -Xmx10g
3 data nodes (mones1, 2 and 3) each with 14 GByte of RAM, ES starts with -Xms7g -Xmx7g
ES has 154 indices, ~ 1500 shards, ~18Mio documents
~145 of the indices are logstash indices (with default settings)

Once Kibana starts looking for the Time-field name the following happens:

CPU load increases on all data nodes
all data nodes report a heavy memory load (lots of message like: [gc][4593] overhead, spent [257ms] collecting in the last [1s])
master node reports request timeouts (org.elasticsearch.transport.ReceiveTimeoutTransportException: [mones1][10.20.1.21:9300][cluster:monitor/nodes/stats[n]] request_id [38670] timed out after [15001ms])
Kibana reports a timeout in the UI (default after 30 seconds)
eventually (couple of minutes) the master node reports that it has received responses for requests that have timed out ( [monesmaster] Received response for a request that has timed out, sent [30570ms] ago, timed out [15569ms] ago, action [cluster:monitor/nodes/stats[n]], node [{mones1})
the data nodes recover (load gets back to normal, queries work normally)

Even when trying to add a new index pattern for an indices that contains only 1 document the same problem occurs.

Is there a way to trace what kind of query is issued by Kibana that leads to such a heavy load ?
Or is there a workaround (time-out increase, etc.) to solve this ?

Thanks for any help. I really appreciate it.

system · April 26, 2017, 3:57pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.