"operation primary term [1] is too old (current [2])"

I'm novice with Elasticsearch, looking at some issues today. The company subject matter expert is in a different timezone where it is well-past business hours on a Friday night. This doesn't seem critical, so I'm not going to page him.

Starting a couple of hours ago the hourly snapshots of this cluster (es12, version 7.6.1 with 3 masters and 12 data nodes) started having issues. They normally complete in ~5 minutes, but the one from 5 hours ago still has not completed.

Looking at system metrics I see one node, named es12-d-f1ba7b3d91e22c6ca475c477 (one of the data nodes in cluster es12), has continuous high network out since the start of the snapshot. So I initially suspected that node as having issues. Looking in the logs there I see WARN level messages in the log like:

{"type": "server", "timestamp": "2022-01-14T16:36:43,913Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es12", "node.name": "es12-d-f1ba7b3d91e22c6ca475c477.FQDN", "message": "[[ping-2022.01.14][1]] failed to perform indices:data/write/bulk[s] on replica [ping-2022.01.14][1], node[b7VHYPmtQk6RTrRLkjCTKA], [R], s[STARTED], a[id=FdgBwVVjR_iYmDRL2wLZPA]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "dkm7td-dR3eCSGtJaBAkhQ" ,
"stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [es12-d-d542251dda0867a36e266268.FQDN][10.48.183.179:9300][indices:data/write/bulk[s][r]] disconnected"] }

There are no error level entries, and no warnings for other types of exception. All of these warnings are for the same other data node, es12-d-d542251dda0867a36e266268 and all have 2022-01-14T16:36:43 timestamps. Now I have a new suspect for the problem.

Looking at system metrics there I see the heap usage started to go wild about 8:36 Pacific time (= 16:36 UTC), a little prior to the snapshot which hasn't yet completed:

heap

The bright yellow jagged line is heap used by es12-d-d542251dda0867a36e266268, the turquoise line is es12-d-f1ba7b3d91e22c6ca475c477, the green line is another data node for comparison, and three master nodes are near the bottom. There are threshold lines for data and master nodes, too.

Looking in the logs on es12-d-d542251dda0867a36e266268 I see that there are a pair of garbage collection warnings right when that heap goes wild, then an exception (named in topic title) for which I can find no information:

{"type": "server", "timestamp": "2022-01-14T16:36:45,340Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[gc][old][2559366][255519] duration [34.3s], collections [2]/[34.8s], total [34.3s]/[4.3h], memory [22gb]->[19.6gb]/[24gb], all_pools {[young] [494.4mb]->[4.5mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [21.3gb]->[19.6gb]/[22.7gb]}", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA"  }
{"type": "server", "timestamp": "2022-01-14T16:36:45,340Z", "level": "WARN", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[gc][2559366] overhead, spent [34.4s] collecting in the last [34.8s]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA"  }
{"type": "server", "timestamp": "2022-01-14T16:36:45,358Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es12", "node.name": "es12-d-d542251dda0867a36e266268.FQDN", "message": "[[samples-2022.01.14][2]] failed to perform indices:data/write/bulk[s] on replica [samples-2022.01.14][2], node[nE9cShvvT3aHrmyx1uhcuw], [R], s[STARTED], a[id=n08RP453TWOXR87KeEle-g]", "cluster.uuid": "eOem6Y73TvmHw8EtD-IBtw", "node.id": "b7VHYPmtQk6RTrRLkjCTKA" , 
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es12-d-81986adb7ccc8d1a1ca093a2.FQDN][10.48.124.56:9300][indices:data/write/bulk[s][r]]",
"Caused by: java.lang.IllegalStateException: [samples-2022.01.14][2] operation primary term [1] is too old (current [2])",
"... 140 + more lines of stack"] }

More of that log, with stacktraces and FQDN edits, in this gist. The messages about losing and finding a master node after that exception strike me as curious.

I'd like to understand what this all means.

The operation primary term [1] is too old (current [2]) message is a consequence of the node holding the primary shard leaving the cluster and a replica shard being promoted to primary in its place.

A ≥30s GC pause would cause the master to consider a node unhealthy and make it leave the cluster.

The main question is why did it have a ≥30s GC pause. Perhaps this is snapshot-related, but your steady-state heap usage looks very high anyway so maybe the snapshot just pushed it over the edge. In any case you are running a very old version, long past EOL, and there have been significant improvements to memory usage in the ~2 years since 7.6 was released, so an upgrade is overdue and strongly recommended.

3 Likes

The hourly snapshots aren't bumping heap anywhere else. But I can look for other things that may have caused it. This is one of about 30 clusters we have, and most of them are newer. There will be an update for this one, but not in the near future.

Thanks for the explanation.