Single unresponsive node stalls overall cluster performance


#1

At seemingly random times, one of my data nodes becomes unresponsive. The box's disk i/o reads get pegged at about 200 MB/s until the Elasticsearch service is restarted.

(the service was restarted at about 11:30 am).

During this time, searches and indexes to the cluster time out, and the search queue gets backed up on the affected node:

GET es5-client01:9200/_cat/thread_pool/search | sort
es5-client01 search  0   0    0
es5-client02 search  0   0    0
es5-client03 search  0   0    0
es5-data01   search  0   0    0
es5-data02   search  0   0    0
es5-data03   search  0   0    0
es5-data04   search  0   0    0
es5-data05   search 13 986 6057
es5-data06   search  0   0    0
es5-data07   search  0   0    0
es5-data08   search  0   0    0
es5-data09   search  0   0    0
es5-data10   search  0   0    0
es5-data11   search  0   0    0
es5-data12   search  0   0    0
es5-master01 search  0   0    0
es5-master02 search  0   0    0
es5-master03 search  0   0    0

Restarting the service on the affected data node immediately improves performance.

I've captured the hot-threads on the affected node: https://pastebin.com/np698JcX

I'm running Elasticsearch v5.5.2 with the search-guard plugin.

I've run into this issue several times over the last month but am at a loss as to what the cause could be. Any advice?


(Christian Dahlqvist) #2

Is there anything in the logs around the time the node gets pegged? Have you been able to track down what data is being written?


#3

There's a removed and added in the logs that correlate with the beginning of the high disk i/o, but that might be a coincidence (beginning about 15:15 in the timestamps below):

[2018-01-03T11:36:09,925][INFO ][o.e.m.j.JvmGcMonitorService] [es5-data05] [gc][185906] overhead, spent [430ms] collecting in the last [1.3s]
[2018-01-03T15:14:29,865][INFO ][o.e.c.s.ClusterService   ] [es5-data05] removed {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{nIZKAKNXSr6IA9k1TZ5Y7g}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11621]])
[2018-01-03T15:15:04,882][INFO ][o.e.c.s.ClusterService   ] [es5-data05] added {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{jTMJAX2GTE609ORBkLjMXw}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11622]])
[2018-01-03T16:30:54,196][INFO ][o.e.x.m.j.p.NativeController] Native controller process has stopped - no new native processes can be started
[2018-01-03T16:30:54,341][INFO ][o.e.n.Node               ] [es5-data05] stopping ...

Note, there's nothing being written to disk, it's high disk reads. When it happens again, I can try to track down what's being read.


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.