Recovery Failure and JVM crash on Solaris/SPARC

Hello,

I'd like to replace one of our cluster nodes as the machine can't handle
the load properly. Currently it's ES 0.19.11 on two Intel machines. We have
a T5220 SPARC that does nothing usefull, so I configured a ES master node
on it. The indices come from logstash-1.1.6-dev and graylog2 0.10.0rc1
(both) with default settings (5 shards, 1 replica - no templates or such).

I can start the ES node on the SPARC machine fine, it joins the cluster and
sits there doing nothing, but as soon I shutdown the to-be-replaced node
and it starts recovering, I get an exception and the JVM crashes. JVM
version is: Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

  • SPARC

In the cluster node's logfile is:

[2012-12-19 10:49:08,300][INFO ][cluster.service ] [es-pheucd01]
removed {[es-phbuild02][hD05TiWaQISW6EjpjV_vfA][inet[/10.215.9.9:9300]],},
reason: zen-disco-receive(from master [[es-phewu01][bf8yaQ3CS6GQ4a4TDxQ8Uw
[inet[/10.215.9.10:9300]]])
[2012-12-19 10:50:11,610][WARN ][transport.netty ] [es-pheucd01]
Message not fully read (response) for [10716] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$3@722c174f), error
[true], resetting
[2012-12-19 10:50:11,644][WARN ][indices.cluster ] [es-pheucd01]
[logstash-weblogic-2012.12.13][4] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[logstash-weblogic-2012.12.13][4]: Recovery failed from
[es-phewu01][bf8yaQ3CS6GQ4a4TDxQ8Uw][inet[/10.215.9.10:9300]] into
[es-pheucd01][vrpQaaQ3TEq8zB8YS2Ff0A][inet[/10.215.9.31:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:293)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$100(RecoveryTarget.java:64)
at
org.elasticsearch.indices.recovery.RecoveryTarget$2.run(RecoveryTarget.java:183)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.RemoteTransportException: Failed to
deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException:
Failed to deserialize exception response from stream
at
org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:171)
at
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:125)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:565)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:793)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:458)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:439)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:565)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:471)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:332)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.StreamCorruptedException: unexpected end of block data
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1369)

Then the JVM crashes.
From hs_err_pid..:

Stack: [0xfffffff0ac500000,0xfffffff0ac580000], sp=0xfffffff0ac57dda0,
free space=503k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
J
org.apache.lucene.index.IndexFileNames.segmentFileName(Ljava/lang/String;Ljava/lang/String;)Ljava/lang/String;
j org.apache.lucene.index.TermVectorsTermsWriter.abort()V+62
j org.apache.lucene.index.TermsHash.abort()V+4
j org.apache.lucene.index.TermsHash.abort()V+31
j org.apache.lucene.index.DocInverter.abort()V+4
j org.apache.lucene.index.DocFieldProcessor.abort()V+24
j org.apache.lucene.index.DocumentsWriter.abort()V+173
j org.apache.lucene.index.IndexWriter.rollbackInternal()V+170
j org.apache.lucene.index.IndexWriter.rollback()V+12
j org.elasticsearch.index.engine.robin.RobinEngine.innerClose()V+65
j org.elasticsearch.index.engine.robin.RobinEngine.close()V+15
j
org.elasticsearch.index.service.InternalIndexService.deleteShard(IZZZLjava/lang/String;)V+362
j
org.elasticsearch.index.service.InternalIndexService.removeShard(ILjava/lang/String;)V+6
j
org.elasticsearch.indices.cluster.IndicesClusterStateService.handleRecoveryFailure(Lorg/elasticsearch/index/service/IndexService;Lorg/elasticsearch/cluster/routing/ShardRouting;ZLjava/lang
/Throwable;)V+109
j
org.elasticsearch.indices.cluster.IndicesClusterStateService.access$300(Lorg/elasticsearch/indices/cluster/IndicesClusterStateService;Lorg/elasticsearch/index/service/IndexService;Lorg/ela
sticsearch/cluster/routing/ShardRouting;ZLjava/lang/Throwable;)V+6
j
org.elasticsearch.indices.cluster.IndicesClusterStateService$PeerRecoveryListener.onRecoveryFailure(Lorg/elasticsearch/indices/recovery/RecoveryFailedException;Z)V+14
j
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(Lorg/elasticsearch/index/shard/service/InternalIndexShard;Lorg/elasticsearch/indices/recovery/StartRecoveryRequest;ZLorg/elasti
csearch/indices/recovery/RecoveryTarget$RecoveryListener;)V+947
j
org.elasticsearch.indices.recovery.RecoveryTarget.access$100(Lorg/elasticsearch/indices/recovery/RecoveryTarget;Lorg/elasticsearch/index/shard/service/InternalIndexShard;Lorg/elasticsearch
/indices/recovery/StartRecoveryRequest;ZLorg/elasticsearch/indices/recovery/RecoveryTarget$RecoveryListener;)V+6
j org.elasticsearch.indices.recovery.RecoveryTarget$2.run()V+20
J
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
V [libjvm.so+0x21dcec] void
JavaCalls::call_helper(JavaValue*,methodHandle*,JavaCallArguments*,Thread*)+0x37c
V [libjvm.so+0x74fd04] void
JavaCalls::call_virtual(JavaValue*,Handle,KlassHandle,Symbol*,Symbol*,Thread*)+0x1ac
V [libjvm.so+0x2d351c] void thread_entry(JavaThread*,Thread*)+0x15c
V [libjvm.so+0xb720c8] void JavaThread::thread_main_inner()+0x88
V [libjvm.so+0x2cedc4] void JavaThread::run()+0x3a4
V [libjvm.so+0xa45ccc] java_start+0x364

Can anybody see the cause of this? I'd be glad to provide more info and
full logfile/crash report if it helps.

Best regards,
Thomas

--

Am 19.12.2012 14:27, schrieb Thomas Kuther:

Hello,

I'd like to replace one of our cluster nodes as the machine can't
handle the load properly. Currently it's ES 0.19.11 on two Intel
machines. We have a T5220 SPARC that does nothing usefull, so I
configured a ES master node on it. The indices come from
logstash-1.1.6-dev and graylog2 0.10.0rc1 (both) with default settings
(5 shards, 1 replica - no templates or such).

I can start the ES node on the SPARC machine fine, it joins the
cluster and sits there doing nothing, but as soon I shutdown the
to-be-replaced node and it starts recovering, I get an exception and
the JVM crashes. JVM version is: Java HotSpot(TM) 64-Bit Server VM
(build 23.5-b02, mixed mode) - SPARC

In the cluster node's logfile is:

[2012-12-19 10:49:08,300][INFO ][cluster.service ]
[es-pheucd01] removed
{[es-phbuild02][hD05TiWaQISW6EjpjV_vfA][inet[/10.215.9.9:9300]],},
reason: zen-disco-receive(from master [[es-phewu01][bf8yaQ3CS6GQ4a4TDxQ8Uw
[inet[/10.215.9.10:9300]]])
[2012-12-19 10:50:11,610][WARN ][transport.netty ]
[es-pheucd01] Message not fully read (response) for [10716] handler
future(org.elasticsearch.indices.recovery.RecoveryTarget$3@722c174f),
error [true], resetting
[2012-12-19 10:50:11,644][WARN ][indices.cluster ]
[es-pheucd01] [logstash-weblogic-2012.12.13][4] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[logstash-weblogic-2012.12.13][4]: Recovery failed from
[es-phewu01][bf8yaQ3CS6GQ4a4TDxQ8Uw][inet[/10.215.9.10:9300]] into
[es-pheucd01][vrpQaaQ3TEq8zB8YS2Ff0A][inet[/10.215.9.31:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:293)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$100(RecoveryTarget.java:64)
at
org.elasticsearch.indices.recovery.RecoveryTarget$2.run(RecoveryTarget.java:183)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.elasticsearch.transport.RemoteTransportException:
Failed to deserialize exception response from stream

the deserialization exception above made me think a bit, until it
suddenly hit me:
the two nodes on the Intel machines use JDK 1.6.0_25, and I setup the
Solaris node with JDK 1.7.0_09.

I took the whole cluster down and started the Intel ones with JDK
1.7.0_09, too.
Now the Solaris node works just fine and is currently recovering the shards.

So, mixing CPU architectures and JDK major versions is a bad idea...

Regards,
Tom

--

Hi Thomas,

you encountered two different effects:

  • the deserialization exception "java.io.StreamCorruptedException:
    unexpected end of block data" may be thrown when different JVM versions are
    participating in transporting Java objects (like exceptions) over the
    network. Unfortunately the transport does not always work well, due to
    subtle serialization issues. You should use only one JVM version in the
    cluster nodes, and this issue should go away.

  • the other effect is the JVM crash. A JVM does not crash because of
    serialization exceptions. JVM HotSpot 23.5-b02 is Java 1.7.0_09. Can you
    check all the JVM parameters that are active? Note, if you use flags like
    -XX:+AggressiveOpts, you may have enabled experimental code in the JVM
    HotSpot optimization, which is probably not your intention.

It is possible you will encounter more issues with mixed Intel and SPARC
JVMs that are not related to Elasticsearch. I haven't had the courage yet
to try such an exotic cluster, but I'm curious to explore if ES can run
stable when using mixed JVMs, because I also have Linux intel and Solaris
SPARC machines available here.

Best regards,

Jörg

--