StackOverflowError in ES

On our cluster of ES running 1.3.0 over 6 nodes, we observed a stackoverflow error on one of the nodes. The system has been under moderate bulk indexing operations.

The clients were receiving a Node disconnected exception for this node of ES and further drilling down into the logs showed the exception as shown below.

Had to restart the elasticsearch service but the cluster failed to recover to a green state. Atleast 2 indices were stuck initializing or relocating.

[2015-07-09 08:59:54,214][WARN ][org.elasticsearch        ] Exception cause unwrapping ran for 10 levels...
org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-3-QA2906-perf][inet[/172.31.34.173:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: [metrics-datastore-6-QA2906-perf][inet[/172.31.44.76:9300]][bulk/shard]
Caused by: org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
        at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:173)
        at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:125)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
        at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
        at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
        at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
        at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.StackOverflowError

It mentions serialisation problems, which can be caused by different JVM or client versions, so it might be best to check those.

JVM and client version mismatches can be ruled out. Both server and client are of 1.3.0. The systems have been running for quite some time and only after a constant load this exception showed up.

Could this be an issue with ES? What is it serializing/deserializing?

had similar exception as yours , read here https://github.com/elastic/elasticsearch/issues/4639 though this does not happen anymore, not even sure how to reproduce. hope it will give additional information to pin down the source of problem.

Thanks @Jason_Wee. Will update that ticket to provide more info.