Single unresponsive node stalls overall cluster performance

At seemingly random times, one of my data nodes becomes unresponsive. The box's disk i/o reads get pegged at about 200 MB/s until the Elasticsearch service is restarted.

(the service was restarted at about 11:30 am).

During this time, searches and indexes to the cluster time out, and the search queue gets backed up on the affected node:

GET es5-client01:9200/_cat/thread_pool/search | sort
es5-client01 search  0   0    0
es5-client02 search  0   0    0
es5-client03 search  0   0    0
es5-data01   search  0   0    0
es5-data02   search  0   0    0
es5-data03   search  0   0    0
es5-data04   search  0   0    0
es5-data05   search 13 986 6057
es5-data06   search  0   0    0
es5-data07   search  0   0    0
es5-data08   search  0   0    0
es5-data09   search  0   0    0
es5-data10   search  0   0    0
es5-data11   search  0   0    0
es5-data12   search  0   0    0
es5-master01 search  0   0    0
es5-master02 search  0   0    0
es5-master03 search  0   0    0

Restarting the service on the affected data node immediately improves performance.

I've captured the hot-threads on the affected node: https://pastebin.com/np698JcX

I'm running Elasticsearch v5.5.2 with the search-guard plugin.

I've run into this issue several times over the last month but am at a loss as to what the cause could be. Any advice?

Is there anything in the logs around the time the node gets pegged? Have you been able to track down what data is being written?

There's a removed and added in the logs that correlate with the beginning of the high disk i/o, but that might be a coincidence (beginning about 15:15 in the timestamps below):

[2018-01-03T11:36:09,925][INFO ][o.e.m.j.JvmGcMonitorService] [es5-data05] [gc][185906] overhead, spent [430ms] collecting in the last [1.3s]
[2018-01-03T15:14:29,865][INFO ][o.e.c.s.ClusterService   ] [es5-data05] removed {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{nIZKAKNXSr6IA9k1TZ5Y7g}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11621]])
[2018-01-03T15:15:04,882][INFO ][o.e.c.s.ClusterService   ] [es5-data05] added {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{jTMJAX2GTE609ORBkLjMXw}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11622]])
[2018-01-03T16:30:54,196][INFO ][o.e.x.m.j.p.NativeController] Native controller process has stopped - no new native processes can be started
[2018-01-03T16:30:54,341][INFO ][o.e.n.Node               ] [es5-data05] stopping ...

Note, there's nothing being written to disk, it's high disk reads. When it happens again, I can try to track down what's being read.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.