Single unresponsive node stalls overall cluster performance

drs · January 3, 2018, 8:24pm

At seemingly random times, one of my data nodes becomes unresponsive. The box's disk i/o reads get pegged at about 200 MB/s until the Elasticsearch service is restarted.

(the service was restarted at about 11:30 am).

During this time, searches and indexes to the cluster time out, and the search queue gets backed up on the affected node:

GET es5-client01:9200/_cat/thread_pool/search | sort
es5-client01 search  0   0    0
es5-client02 search  0   0    0
es5-client03 search  0   0    0
es5-data01   search  0   0    0
es5-data02   search  0   0    0
es5-data03   search  0   0    0
es5-data04   search  0   0    0
es5-data05   search 13 986 6057
es5-data06   search  0   0    0
es5-data07   search  0   0    0
es5-data08   search  0   0    0
es5-data09   search  0   0    0
es5-data10   search  0   0    0
es5-data11   search  0   0    0
es5-data12   search  0   0    0
es5-master01 search  0   0    0
es5-master02 search  0   0    0
es5-master03 search  0   0    0

Restarting the service on the affected data node immediately improves performance.

I've captured the hot-threads on the affected node: https://pastebin.com/np698JcX

I'm running Elasticsearch v5.5.2 with the search-guard plugin.

I've run into this issue several times over the last month but am at a loss as to what the cause could be. Any advice?

Christian_Dahlqvist · January 4, 2018, 10:04am

Is there anything in the logs around the time the node gets pegged? Have you been able to track down what data is being written?

drs · January 4, 2018, 1:19pm

There's a removed and added in the logs that correlate with the beginning of the high disk i/o, but that might be a coincidence (beginning about 15:15 in the timestamps below):

[2018-01-03T11:36:09,925][INFO ][o.e.m.j.JvmGcMonitorService] [es5-data05] [gc][185906] overhead, spent [430ms] collecting in the last [1.3s]
[2018-01-03T15:14:29,865][INFO ][o.e.c.s.ClusterService   ] [es5-data05] removed {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{nIZKAKNXSr6IA9k1TZ5Y7g}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11621]])
[2018-01-03T15:15:04,882][INFO ][o.e.c.s.ClusterService   ] [es5-data05] added {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{jTMJAX2GTE609ORBkLjMXw}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11622]])
[2018-01-03T16:30:54,196][INFO ][o.e.x.m.j.p.NativeController] Native controller process has stopped - no new native processes can be started
[2018-01-03T16:30:54,341][INFO ][o.e.n.Node               ] [es5-data05] stopping ...

Note, there's nothing being written to disk, it's high disk reads. When it happens again, I can try to track down what's being read.

system · February 1, 2018, 1:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster node unresponsive after search Elasticsearch	2	676	July 5, 2017
Weird cluster failure (unresponsive node + unresponsive load balancer) Elasticsearch	3	989	July 6, 2017
Elastic data nodes are randomly getting unresponsive Elasticsearch	1	455	January 11, 2022
Certain rest requests time out Elasticsearch	15	480	July 6, 2017
ES 5.4.1: Totally random cluster stalling (100% CPU) about 1-2 times per day: We're out of ideas Elasticsearch	8	1300	July 21, 2017

Single unresponsive node stalls overall cluster performance

Related topics