At seemingly random times, one of my data nodes becomes unresponsive. The box's disk i/o reads get pegged at about 200 MB/s until the Elasticsearch service is restarted.
There's a removed and added in the logs that correlate with the beginning of the high disk i/o, but that might be a coincidence (beginning about 15:15 in the timestamps below):
[2018-01-03T11:36:09,925][INFO ][o.e.m.j.JvmGcMonitorService] [es5-data05] [gc][185906] overhead, spent [430ms] collecting in the last [1.3s]
[2018-01-03T15:14:29,865][INFO ][o.e.c.s.ClusterService ] [es5-data05] removed {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{nIZKAKNXSr6IA9k1TZ5Y7g}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11621]])
[2018-01-03T15:15:04,882][INFO ][o.e.c.s.ClusterService ] [es5-data05] added {{es5-client01}{JvzC36f5QWCv5wpykOl1gg}{jTMJAX2GTE609ORBkLjMXw}{10.208.0.137}{10.208.0.137:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es5-master01}{n-WLDE5PSC6Z727V_Jx4CQ}{RN6za0HDTjSySu4BcyLaXw}{10.100.4.27}{10.100.4.27:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [11622]])
[2018-01-03T16:30:54,196][INFO ][o.e.x.m.j.p.NativeController] Native controller process has stopped - no new native processes can be started
[2018-01-03T16:30:54,341][INFO ][o.e.n.Node ] [es5-data05] stopping ...
Note, there's nothing being written to disk, it's high disk reads. When it happens again, I can try to track down what's being read.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.