Elasticsearch 7.16.1 crashing randomly

Hi,
after upgrading to Elasticsearch 7.16.1 (Debian-package from elastic.co),
we're seeing some graver errors and crashes on some clusters:

Instances are running with Debian Buster (10)

[2021-12-14T00:00:06,036][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [es-cluster1] fatal error in thread [elasticsearch[es-cluster1][clusterApplierService#updateTask][T#1]], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/common/xcontent/XContent
        at org.elasticsearch.xpack.ilm.PolicyStepsRegistry.update(PolicyStepsRegistry.java:133) ~[?:?]
        at org.elasticsearch.xpack.ilm.IndexLifecycleService.applyClusterState(IndexLifecycleService.java:323) ~[?:?]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:553) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:540) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:56) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225) ~[elasticsearch-7.16.1.jar:7.16.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.common.xcontent.XContent
        at java.net.URLClassLoader.findClass(URLClassLoader.java:476) ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:589) ~[?:?]
        at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:904) ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[?:?]
        ... 14 more
[2021-12-14T00:00:03,700][ERROR][o.e.ExceptionsHelper     ] [es-cluster1] fatal error
        at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$4(ExceptionsHelper.java:287)
        at java.base/java.util.Optional.ifPresent(Optional.java:183)
        at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:277)
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.exceptionCaught(Netty4HttpRequestHandler.java:43)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:302)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:381)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:48)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Thread.java:829)

Java:

java -version
openjdk version "11.0.13" 2021-10-19
OpenJDK Runtime Environment Temurin-11.0.13+8 (build 11.0.13+8)
OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (build 11.0.13+8, mixed mode)

Previous version: ES 7.14.1

The ES-nodes are often crashing as a pair, without any further interaction or load.

Can someone enlighten me what's happening here?

Any chance you have a custom plugin that was complied against an older version? The org.Elasticsearch.common.xcontent.XContent class was moved to org.Elasticsearch.xcontent.XContent in 7.16.0. in Fix split package org.elasticsearch.common.xcontent by ChrisHegarty · Pull Request #78831 · elastic/elasticsearch · GitHub.

Hi @Keith_Massey ,
there are no plugins at all in these clusters.

This is "pure Elastic.co" Debian-Images :wink:
The configuration itself is also not special, security is enabled, that is all.

Funny / not funny: clusters/nodes with only marginal load tend to break earlier.

I wonder what is the output for the following request?

curl -X GET "localhost:9200/_nodes/plugins?pretty=true"  | grep \"version\"

Actually I also have the same problem of crashing of logstash. after upgrading to 7.16.1
The weird part is if I run from command line.. exactly the same command as running from systemd (in ubuntu), it works, and all the pipeline works. From systemd it crashes every 10-12 seconds. (systemctl start logstash)

/usr/share/logstash/bin/logstash --log.level=debug --path.settings /etc/logstash

If I do the plugins command I get a super long list of
"version" : "7.16.1",
"elasticsearch_version" : "7.16.1",
"java_version" : "1.8",
"version" : "7.16.1",
"elasticsearch_version" : "7.16.1",
"java_version" : "1.8",
"version" : "7.16.1",
"elasticsearch_version" : "7.16.1",
"java_version" : "1.8",
"version" : "7.16.1",
"elasticsearch_version" : "7.16.1",
"java_version" : "1.8",
"version" : "7.16.1",
"elasticsearch_version" : "7.16.1",
"java_version" : "1.8",
"version" : "7.16.1",
....

Regards,

Michael

Hey, I work together with Paul. The output for your request only shows the following for all our clusters:

"version" : "7.16.1"

Kind Regards,
Jan

There was still some Kibana 7.14.1 running (when ES broke), but with Kibana 7.16.1 it currently looks more stable...
We'll keep you posted.

1 Like

Scenario:
ELK-Stack with 10 nodes:

  • 3 dedicated master nodes
  • 3 data-nodes (hot)
  • 3 data-nodes (warm)
  • 1 no-data/no-master node (for Kibana)

^or:
Elasticsearch-Cluster with 3-nodes:

  • 3 master/data nodes
  1. everything is running fine, all is green.
  2. one Elasticsearch-instance is restarted (like systemctl restart Elasticsearch)
  3. another Elasticsearch-instance crashes immediately

What is going on...

... digging in the log-messages turned out:

If you restart the master, (in a 3-node-cluster), other nodes will also crash.

^ this is not repeatable in the ECK-setup, only on "Debian"-package configurations.

We have built up some test-cases and it turns out that

The existing clusters[1] are vulnerable when the master-node is restarted, which results in one or more (master)-nodes to crash.

A fresh cluster[2] is completely immune to these tests.
Let it be with/without the exact data (of the existing cluster), independent of "our JDK" or the Elastic-JDK, independent of the ES-version installed (fresh-install vs. upgraded from 7.14.1).

I fear that something in our Cluster-State and/or Node-State is triggering some issues on the master-election.

We can't really prove it or pin it down but will now starting to migrate these clusters "away" to ECK.

[1] Clusters running for ~1 year, continuously updated with ES-versions + Java-versions over the time
[2] Cluster setup from scratch today, no older dependencies, versions or configuration applied

I don't think there's a way for anything in your data path to cause this exception. The symptoms you report indicate that your Elasticsearch installation is a strange mix of binaries from different versions. Since there's no extra plugins involved it suggests to me that something has gone wrong in the upgrade processes that you've performed, and the fact that fresh installations work properly is consistent with this guess too. It might be due to a packaging bug or something strange about your specific environment, I don't have any great ideas for how to determine which.

Could you run sudo ls -lR in the location in which Elasticsearch is installed and share the output here? The location in question will contain bin/elasticsearch and lib/elasticsearch-7.16.1.jar (or whatever version you're using). Sorry I don't know where the .deb installers put this.

If you don't want to debug and would rather just fix things, I'd suggest you reinstall Elasticsearch on these nodes making sure that any remnants of old installations are removed first. There's no need to remove the data, just the old installations.

Sure.

In Debian it is /usr/share/elasticsearch.
The output is found here: https://gist.github.com/ppuschmann/a485484d959a9ffb2095d0d02384049e

I also did this before:

cd /tmp
apt download elasticsearch
dpkg -x elasticsearch_7.16.1_amd64.deb .
diff -qr /tmp/usr/share/elasticsearch/ /usr/share/elasticsearch/

Result: no diff

I'm trying reproduce the issues with one of our remaining clusters and try to purge and reinstall the packages (creating a snapshot of the current state first).

@DavidTurner We've been running into the same issue when upgrading an Elasticsearch 7.10.2 node to Elasticsearch 7.16.1 in one of our test environments. In this case it's not even a cluster but only a single node. We're also using the official Elasticsearch DEB packages.

We're using the Java High-Level REST Client to talk to the Elasticsearch node, so I wouldn't expect anything from the client communications triggering the error (which I would if we were using the old transport client/binary transport).

I can trigger the error (see full stack trace below) with curl running an Update By Query with a small Painless script (1 line) to update documents.

I've set the log level of the Elasticsearch node to DEBUG in the hope of getting more details about what's causing the issue, but I wasn't lucky enough to catch anything.

Complete stack trace:

[2021-12-17T13:34:06,581][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [example.com] fatal error in thread [Thread-21], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/common/xcontent/XContent
        at org.elasticsearch.reindex.TransportUpdateByQueryAction.lambda$doExecute$0(TransportUpdateByQueryAction.java:82) ~[?:?]
        at org.elasticsearch.reindex.BulkByScrollParallelizationHelper.executeSlicedAction(BulkByScrollParallelizationHelper.java:95) ~[?:?]
        at org.elasticsearch.reindex.BulkByScrollParallelizationHelper.lambda$startSlicedAction$0(BulkByScrollParallelizationHelper.java:69) ~[?:?]
        at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:219) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.reindex.BulkByScrollParallelizationHelper.initTaskState(BulkByScrollParallelizationHelper.java:125) ~[?:?]
        at org.elasticsearch.reindex.BulkByScrollParallelizationHelper.startSlicedAction(BulkByScrollParallelizationHelper.java:65) ~[?:?]
        at org.elasticsearch.reindex.TransportUpdateByQueryAction.doExecute(TransportUpdateByQueryAction.java:73) ~[?:?]
        at org.elasticsearch.reindex.TransportUpdateByQueryAction.doExecute(TransportUpdateByQueryAction.java:42) ~[?:?]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:179) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:53) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.apply(SecurityActionFilter.java:145) ~[?:?]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:177) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:154) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:95) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.reindex.AbstractBaseReindexRestHandler.lambda$doPrepareRequest$0(AbstractBaseReindexRestHandler.java:52) ~[?:?]
        at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:109) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.xpack.security.rest.SecurityRestFilter.handleRequest(SecurityRestFilter.java:105) ~[?:?]
        at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:327) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.rest.RestController.tryAllHandlers(RestController.java:393) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:245) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:382) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:461) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:357) ~[elasticsearch-7.16.1.jar:7.16.1]
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:32) ~[?:?]
        at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:18) ~[?:?]
        at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at org.elasticsearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:48) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) ~[?:?]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.common.xcontent.XContent
        at jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581) ~[?:?]
        at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178) ~[?:?]
        at java.lang.ClassLoader.loadClass(ClassLoader.java:522) ~[?:?]
        ... 77 more

Verification of installed DEB package:

a44e37b7f66c99078cba5d87b86559f7cab4e9775197920f1add52b8b84602429d5066d427f8b97e39efe3edae7a10b2adb872c2c239b80168c238babab4a002  elasticsearch-7.16.1-amd64.deb
a44e37b7f66c99078cba5d87b86559f7cab4e9775197920f1add52b8b84602429d5066d427f8b97e39efe3edae7a10b2adb872c2c239b80168c238babab4a002  /var/cache/apt/archives/elasticsearch_7.16.1_amd64.deb
1 Like

Can you turn on verbose classloading? One way to do that is to start Elasticsearch with something like:
ES_JAVA_OPTS="-verbose" bin/elasticsearch
That way we can find out where the class is being loaded from that is trying to load XContent.

1 Like

Thanks for the hint, Keith!

I'm afraid I "broke" the reproducer by playing around too much with the data. It doesn't fail anymore right now but I'll make sure to add the -verbose flag once it happens again on another node.

@Keith_Massey I still had some cluster left and activated ES_JAVA_OPTS="verbose" on the nodes.

Example of a process:

/usr/lib/jvm/adoptopenjdk-java11-jdk-amd64/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Dlog4j2.formatMsgNoLookups=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/tmp -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:-UseConcMarkSweepGC -XX:-UseCMSInitiatingOccupancyOnly -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -verbose -XX:MaxDirectMemorySize=2147483648 -XX:G1HeapRegionSize=4m -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet

The cluster was running fine and stable after introducing -verbose.

  1. I truncated the logs and then restarted the master ( es-find-2 at this time).
  2. This resulted in a crash of es-find-1 (and es-find-3?).
  3. Shortly after es-find-1 and es-find-3 crashed, I restarted these ES-nodes.

edit: link verbose-logs removed, these were useless anyways

Hi @ppuschmann -- the output of -verbose doesn't go to the log file. It only goes to stdout. If you started the process from the command line you'll see the verbose logs there in your command line console.

Ouch. Thx. This of course explained why I didn't spot anything in the logs :wink:

I'll try this out.

Ah, wonderful, I can't reproduce it anymore.

So what did I do?

  1. systemctl edit elasticsearch.service --> add this to the systemd-unit: StandardOutput=file:/var/log/elasticsearch/stdout.log
  2. Restarted all instances at once (it is a test-cluster, so this was fine)

Unfortunately due to 2.) I can't reproduce the error anymore, I guess.

So, I'm very sorry, my story ends here. :frowning:

Maybe a full cluster-restart clears up whatever that was going on?

@Keith_Massey I found some logs:

This is another cluster with 3 master(-only) nodes and some data-nodes.

When I restarted the master-node, the two other master-eligible-nodes crashed with identical logs sequentially.

fatal error in thread [elasticsearch[logindexer-3][clusterApplierService#updateTask][T#1]], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/common/xcontent/XContent
	at org.elasticsearch.xpack.ilm.PolicyStepsRegistry.update(PolicyStepsRegistry.java:133)
	at org.elasticsearch.xpack.ilm.IndexLifecycleService.applyClusterState(IndexLifecycleService.java:323)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:553)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:540)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:503)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:428)
	at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:56)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:154)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:262)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:225)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.common.xcontent.XContent
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	at java.base/java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:904)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	... 14 more
ERROR: Elasticsearch did not exit normally - check the logs at /var/log/elasticsearch/elkng-gce-int.log

Is this helpful?