I'm novice with Elasticsearch, looking at some issues today. The company subject matter expert is in a different timezone where it is well-past business hours on a Friday night. This doesn't seem critical, so I'm not going to page him.
Starting a couple of hours ago the hourly snapshots of this cluster (es12, version 7.6.1 with 3 masters and 12 data nodes) started having issues. They normally complete in ~5 minutes, but the one from 5 hours ago still has not completed.
Looking at system metrics I see one node, named es12-d-f1ba7b3d91e22c6ca475c477
(one of the data nodes in cluster es12), has continuous high network out since the start of the snapshot. So I initially suspected that node as having issues. Looking in the logs there I see WARN level messages in the log like:
{"type": "server", "timestamp": "2022-01-14T16:36:43,913Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es12", "node.name": "es12-d-f1ba7b3d91e22c6ca475c477.FQDN", "message": "[[ping-2022.01.14][1]] failed to perform indices:data/write/bulk[s] on replica [ping-2022.01.14][1], node[b7VHYPmtQk6RTrRLkjCTKA], [R], s[STARTED], a[id=FdgBwVVjR_iYmDRL2wLZPA]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "dkm7td-dR3eCSGtJaBAkhQ" ,
"stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [es12-d-d542251dda0867a36e266268.FQDN][10.48.183.179:9300][indices:data/write/bulk[s][r]] disconnected"] }
There are no error level entries, and no warnings for other types of exception. All of these warnings are for the same other data node, es12-d-d542251dda0867a36e266268
and all have 2022-01-14T16:36:43 timestamps. Now I have a new suspect for the problem.
Looking at system metrics there I see the heap usage started to go wild about 8:36 Pacific time (= 16:36 UTC), a little prior to the snapshot which hasn't yet completed:
The bright yellow jagged line is heap used by es12-d-d542251dda0867a36e266268
, the turquoise line is es12-d-f1ba7b3d91e22c6ca475c477
, the green line is another data node for comparison, and three master nodes are near the bottom. There are threshold lines for data and master nodes, too.
Looking in the logs on es12-d-d542251dda0867a36e266268
I see that there are a pair of garbage collection warnings right when that heap goes wild, then an exception (named in topic title) for which I can find no information:
{"type": "server", "timestamp": "2022-01-14T16:36:45,340Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[gc][old][2559366][255519] duration [34.3s], collections [2]/[34.8s], total [34.3s]/[4.3h], memory [22gb]->[19.6gb]/[24gb], all_pools {[young] [494.4mb]->[4.5mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [21.3gb]->[19.6gb]/[22.7gb]}", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA" }
{"type": "server", "timestamp": "2022-01-14T16:36:45,340Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[gc][2559366] overhead, spent [34.4s] collecting in the last [34.8s]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA" }
{"type": "server", "timestamp": "2022-01-14T16:36:45,358Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[[samples-2022.01.14][2]] failed to perform indices:data/write/bulk[s] on replica [samples-2022.01.14][2], node[nE9cShvvT3aHrmyx1uhcuw], [R], s[STARTED], a[id=n08RP453TWOXR87KeEle-g]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA" ,
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es12-d-81986adb7ccc8d1a1ca093a2.FQDN][10.48.124.56:9300][indices:data/write/bulk[s][r]]",
"Caused by: java.lang.IllegalStateException: [samples-2022.01.14][2] operation primary term [1] is too old (current [2])",
"... 140 + more lines of stack"] }
More of that log, with stacktraces and FQDN edits, in this gist. The messages about losing and finding a master node after that exception strike me as curious.
I'd like to understand what this all means.