We use elasticsearch 7.2.0
Yesterday we had an incident where nearly 4-5mb (request size) query was fired by one of our client. We use master nodes as our coordinating nodes ( i know, bad practice )
One of our master nodes load avg and CPU spiked upto 200% and 99% respectively. The cluster lost the master node, it took some time for a new master node to be reelected. surprisingly, when a master was formed, shards were marked as stale and our cluster went RED (can somebody guess why?) Anyway, 30 mins later we ran the reroute API with retry_failed to bring back the cluster to yellow. I was surprised why the shards were marked stale!
Some relevant logs around that time
Follower/Leader check failures (es7advcl02-02 was the master node that was crippling)
[es7advcl02-02] Received response for a request that has timed out, sent [13073ms] ago, timed out [3001ms] ago, action [internal:coordination/fault_detection/follower_check], node [{es7advcl02-07}{UOigj6TkQMW2zAJn0WmE1A}{R_W2oe_STqypM5KWfVFyDg}{10.0.7.50}{10.0.7.50:9301}{nodetag=nonissues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}], id [6989792142]
[es7advcl02-14] Received response for a request that has timed out, sent [12609ms] ago, timed out [2602ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es7advcl02-02}{vksespSzR9222wbJQPbrVA}{7qcK10LcT7a_5rlzR5d5tA}{10.0.11.79}{10.0.11.79:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}], id [446407023]
^ lots of it from different data nodes
Then what looked like the cluster was formed
[es7advcl02-02] node-join[{es7advcl02-08}{yffmKXk1RdGZY7spbUdnqA}{3Zx8EElyQWq9WGTE80_WPw}{10.0.11.141}{10.0.11.141:9301}{nodetag=nonissues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true} join existing leader, {es7advcl02-10}{D19ebXeBQ2yJKeky3iypUg}{w290_FcWSmeYUD1ijFE6iA}{10.0.7.13}{10.0.7.13:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 14, version: 220553, reason: added {{es7advcl02-10}{D19ebXeBQ2yJKeky3iypUg}{w290_FcWSmeYUD1ijFE6iA}{10.0.7.13}{10.0.7.13:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true},}
Couples of seconds later
[es7advcl02-10] master not discovered yet: have discovered [{es7advcl02-01}{zlNZpvnpTZ-0F529R9QwLQ}{4TjfPFXRSue2Wcf_dWutTw}{10.0.11.62}{10.0.11.62:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-03}{dBnlR_mvRN-bq7d7oNC1tg}{Mp5_-qXLT_qJmhzZnheaVA}{10.0.7.137}{10.0.7.137:9301}{ml.machine_memory=8061108224, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-02}{vksespSzR9222wbJQPbrVA}{7qcK10LcT7a_5rlzR5d5tA}{10.0.11.79}{10.0.11.79:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}]; discovery will continue using [10.0.11.197:9301, 10.0.11.4:9301, 10.0.11.205:9301, 10.0.11.79:9301, 10.0.7.159:9301, 10.0.7.50:9301, 10.0.11.27:9301, 10.0.13.117:9301, 10.0.7.137:9301, 10.0.7.95:9301, 10.0.11.62:9301, 10.0.11.165:9301, 10.0.11.141:9301] from hosts providers and [{es7advcl02-15}{n3pxz-nESre4LFpBTx6tdw}{z9k5aqq6Qe-V3aVO7LeQpg}{10.0.11.27}{10.0.11.27:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-03}{dBnlR_mvRN-bq7d7oNC1tg}{Mp5_-qXLT_qJmhzZnheaVA}{10.0.7.137}{10.0.7.137:9301}{ml.machine_memory=8061108224, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-12}{rXuyhi33SYSRqdqeMuOeqw}{k_jtyiD_T_eYzz0P7cDjtQ}{10.0.7.95}{10.0.7.95:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-13}{Ek04Z48vRNOBXbNS5SQZqw}{wdWXtgGgS-eibaEo1v5nDw}{10.0.11.165}{10.0.11.165:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-11}{bnXSHTgSRV63OZ-scrCUCA}{9tidwJH1Rt2Q5L5usHcTKA}{10.0.11.205}{10.0.11.205:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-16}{6vaOs45ISuuLYCqPE68-Rg}{bzlu3HofT56ZMHlEAFJE-A}{10.0.7.159}{10.0.7.159:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-02}{vksespSzR9222wbJQPbrVA}{7qcK10LcT7a_5rlzR5d5tA}{10.0.11.79}{10.0.11.79:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-17}{wyCmlXEmRBiqx3UTMrt9pA}{vMTUsCdwSYah2YuIAd0udA}{10.0.11.4}{10.0.11.4:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-07}{UOigj6TkQMW2zAJn0WmE1A}{R_W2oe_STqypM5KWfVFyDg}{10.0.7.50}{10.0.7.50:9301}{nodetag=nonissues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-01}{zlNZpvnpTZ-0F529R9QwLQ}{4TjfPFXRSue2Wcf_dWutTw}{10.0.11.62}{10.0.11.62:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-10}{D19ebXeBQ2yJKeky3iypUg}{w290_FcWSmeYUD1ijFE6iA}{10.0.7.13}{10.0.7.13:9301}{nodetag=issues, ml.machine_memory=64388841472, xpack.installed=true, ml.max_open_jobs=20}, {es7advcl02-14}{O4OzGrypS6i71XGPJqpOKQ}{a4N-GpzVSeClkAnQU-_jiA}{10.0.11.197}{10.0.11.197:9301}{nodetag=issues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-09}{5JG2mXpmQ2Ca3C3y4HAUOw}{a45pc4khQyGd_m_4NCZmtA}{10.0.13.117}{10.0.13.117:9301}{nodetag=nonissues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}, {es7advcl02-08}{yffmKXk1RdGZY7spbUdnqA}{3Zx8EElyQWq9WGTE80_WPw}{10.0.11.141}{10.0.11.141:9301}{nodetag=nonissues, ml.machine_memory=64388841472, ml.max_open_jobs=20, xpack.installed=true}] from last-known cluster state; node term 14, last-accepted version 220551 in term 14
[es7advcl02-10] Received response for a request that has timed out, sent [14411ms] ago, timed out [11409ms] ago, action [internal:discovery/request_peers], node [{es7advcl02-02}{vksespSzR9222wbJQPbrVA}{7qcK10LcT7a_5rlzR5d5tA}{10.0.11.79}{10.0.11.79:9301}{ml.machine_memory=8149188608, ml.max_open_jobs=20, xpack.installed=true}], id [212488552]
This is what the hot thread looked like
22.6% (113.1ms out of 500ms) cpu usage by thread 'elasticsearch[es7advcl01-03][transport_worker][T#3]'
10/10 snapshots sharing following 77 elements
java.base@12.0.1/java.util.regex.Pattern$Branch.match(Pattern.java:4736)
java.base@12.0.1/java.util.regex.Pattern$Start.match(Pattern.java:3606)
java.base@12.0.1/java.util.regex.Matcher.search(Matcher.java:1729)
java.base@12.0.1/java.util.regex.Matcher.find(Matcher.java:746)
java.base@12.0.1/java.util.regex.Pattern.split(Pattern.java:1264)
java.base@12.0.1/java.lang.String.split(String.java:2326)
java.base@12.0.1/java.lang.String.split(String.java:2373)
app//org.elasticsearch.common.logging.JsonThrowablePatternConverter.formatJson(JsonThrowablePatternConverter.java:87)
app//org.elasticsearch.common.logging.JsonThrowablePatternConverter.format(JsonThrowablePatternConverter.java:72)
app//org.apache.logging.log4j.core.pattern.PatternFormatter.format(PatternFormatter.java:38)
app//org.apache.logging.log4j.core.layout.PatternLayout$PatternSerializer.toSerializable(PatternLayout.java:334)
app//org.apache.logging.log4j.core.layout.PatternLayout.toText(PatternLayout.java:233)
app//org.apache.logging.log4j.core.layout.PatternLayout.encode(PatternLayout.java:218)
app//org.elasticsearch.common.logging.ESJsonLayout.encode(ESJsonLayout.java:108)
app//org.elasticsearch.common.logging.ESJsonLayout.encode(ESJsonLayout.java:42)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.directEncodeEvent(AbstractOutputStreamAppender.java:177)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.tryAppend(AbstractOutputStreamAppender.java:170)
app//org.apache.logging.log4j.core.appender.AbstractOutputStreamAppender.append(AbstractOutputStreamAppender.java:161)
app//org.apache.logging.log4j.core.appender.RollingFileAppender.append(RollingFileAppender.java:309)
app//org.apache.logging.log4j.core.config.AppenderControl.tryCallAppender(AppenderControl.java:156)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppender0(AppenderControl.java:129)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppenderPreventRecursion(AppenderControl.java:120)
app//org.apache.logging.log4j.core.config.AppenderControl.callAppender(AppenderControl.java:84)
app//org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:464)
app//org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:448)
app//org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:431)
app//org.apache.logging.log4j.core.config.LoggerConfig.logParent(LoggerConfig.java:455)
app//org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:450)
app//org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:431)
app//org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:406)
app//org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
app//org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
app//org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2170)
app//org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2125)
app//org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2108)
app//org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1827)
app//org.apache.logging.log4j.spi.AbstractLogger.debug(AbstractLogger.java:293)
app//org.elasticsearch.action.search.InitialSearchPhase.onShardFailure(InitialSearchPhase.java:119)
app//org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:50)
app//org.elasticsearch.action.search.InitialSearchPhase$2.onFailure(InitialSearchPhase.java:273)
app//org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73)
app//org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59)
app//org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:441)
app//org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1111)