Loss of Connection between Nodes

govind201 · March 7, 2013, 2:59am

Every once in a while, one of my master nodes loses connection with the
other (primary and slave) nodes. When I execute 'curl -XGET
localhost:9200/_nodes', the cluster just hangs and I get no response
(cluster health reports that everything is "green" though).

I found this in my log files during one of the errors today:
*2013-03-07 00:44:38,588][WARN ][transport.netty ]
[elasticsearch-server-3] exception caught on transport layer [[id:
0x30a053f6, /10.30.141.74:37560 => /10.151.17.197:9300]], closing connection
*
*java.io.StreamCorruptedException: invalid internal transport message format
*

```
   at 
```

org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*

```
   at 
```

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*

```
   at 
```

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
*

```
   at 
```

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*

```
   at 
```

org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*

```
   at 
```

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*

```
   at 
```

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*

```
   at 
```

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*

```
   at 
```

org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*

```
   at 
```

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*

```
   at 
```

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*

   at java.lang.Thread.run(Thread.java:722)*

Previous posts on the forums suggest that this error is cause by mismatch
of versions between nodes, but that's not the case for me. All nodes run
0.20.2 and function just great 99% of the time.

Any suggestions/ideas would be much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

govind201 · March 7, 2013, 3:21am

Also found this from a few days ago when the same error had occurred:

[2013-03-03 19:55:13,183][DEBUG][action.admin.cluster.node.stats]
[elasticsearch-server-3] failed to execute on node [k9Dq3NsbQCq-RPmh4FS-2w]
org.elasticsearch.transport.RemoteTransportException: Failed to
deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
Caused by: org.elasticsearch.transport.TransportSerializationException:
Failed to deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]

```
   at 
```

org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
*

```
   at 
```

org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
*

```
   at 
```

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*

```
   at 
```

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
*

```
   at 
```

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
*

```
   at 
```

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
*

```
   at 
```

org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
*

```
   at 
```

org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*

```
   at 
```

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*

```
   at 
```

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*

```
   at 
```

org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*

```
   at 
```

org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*

```
   at 
```

org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*

```
   at 
```

org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*

```
   at 
```

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*

```
   at 
```

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*

   at java.lang.Thread.run(Thread.java:722)*

Caused by: java.lang.IndexOutOfBoundsException: Readable byte limit
exceeded: 300

```
   at 
```

org.elasticsearch.common.netty.buffer.AbstractChannelBuffer.readByte(AbstractChannelBuffer.java:236)
*

```
   at 
```

org.elasticsearch.transport.netty.ChannelBufferStreamInput.readByte(ChannelBufferStreamInput.java:121)
*

```
   at 
```

org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:99)*

```
   at 
```

org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:130)
*

```
   at 
```

org.elasticsearch.common.io.stream.AdapterStreamInput.readLong(AdapterStreamInput.java:93)
*

   at org.elasticsearch.monitor.os.OsStats.readFrom(OsStats.java:204)*

```
   at 
```

org.elasticsearch.monitor.os.OsStats.readOsStats(OsStats.java:193)*

```
   at 
```

org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:263)
*

```
   at 
```

org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:148)
*

```
   ... 23 more*
```

On Wednesday, March 6, 2013 6:59:31 PM UTC-8, Govind Chandrasekhar wrote:

Every once in a while, one of my master nodes loses connection with the
other (primary and slave) nodes. When I execute 'curl -XGET
localhost:9200/_nodes', the cluster just hangs and I get no response
(cluster health reports that everything is "green" though).

I found this in my log files during one of the errors today:
2013-03-07 00:44:38,588][WARN ][transport.netty ]
[elasticsearch-server-3] exception caught on transport layer [[id:
0x30a053f6, /10.30.141.74:37560 => /10.151.17.197:9300]], closing
connection
java.io.StreamCorruptedException: invalid internal transport message
format
   at 
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Previous posts on the forums suggest that this error is cause by mismatch
of versions between nodes, but that's not the case for me. All nodes run
0.20.2 and function just great 99% of the time.

Any suggestions/ideas would be much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

govind201_2 · March 11, 2013, 8:48pm

Over the weekend, things went up a notch - this particular cluster became
unresponsive nearly every 2 hours (earlier, it was once a week-ish). I've
added more nodes, stopped all indexing jobs (the cluster is under almost no
load now), done an full optimize, but the issue continues to persist.

Did a tcpdump of port 9300 on the two faulty servers and packets seem to be
flowing just fine (I see them coming in and leaving), so it's probably not
a network issue. I can only assume that the packets are being corrupted
somehow? One other possible cause is that some of the servers run
openjdk-1.7.0_07 and some run openjdk-1.7.0_09; this is unlikely to be the
issue though, since these exceptions happen even between nodes of the same
java version.

I'm lost of ideas. Any help would be really appreciated!

On Wednesday, March 6, 2013 7:21:30 PM UTC-8, Govind Chandrasekhar wrote:

Also found this from a few days ago when the same error had occurred:

*[2013-03-03 19:55:13,183][DEBUG][action.admin.cluster.node.stats]
[elasticsearch-server-3] failed to execute on node [k9Dq3NsbQCq-RPmh4FS-2w]
*
org.elasticsearch.transport.RemoteTransportException: Failed to
deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
Caused by: org.elasticsearch.transport.TransportSerializationException:
Failed to deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Caused by: java.lang.IndexOutOfBoundsException: Readable byte limit
exceeded: 300
   at 
org.elasticsearch.common.netty.buffer.AbstractChannelBuffer.readByte(AbstractChannelBuffer.java:236)
*
   at 
org.elasticsearch.transport.netty.ChannelBufferStreamInput.readByte(ChannelBufferStreamInput.java:121)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:99)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:130)
*
   at 
org.elasticsearch.common.io.stream.AdapterStreamInput.readLong(AdapterStreamInput.java:93)
*
   at 
org.elasticsearch.monitor.os.OsStats.readFrom(OsStats.java:204)*
   at 
org.elasticsearch.monitor.os.OsStats.readOsStats(OsStats.java:193)*
   at 
org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:263)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:148)
*
   ... 23 more*
On Wednesday, March 6, 2013 6:59:31 PM UTC-8, Govind Chandrasekhar wrote:
Every once in a while, one of my master nodes loses connection with the
other (primary and slave) nodes. When I execute 'curl -XGET
localhost:9200/_nodes', the cluster just hangs and I get no response
(cluster health reports that everything is "green" though).

I found this in my log files during one of the errors today:
2013-03-07 00:44:38,588][WARN ][transport.netty ]
[elasticsearch-server-3] exception caught on transport layer [[id:
0x30a053f6, /10.30.141.74:37560 => /10.151.17.197:9300]], closing
connection
java.io.StreamCorruptedException: invalid internal transport message
format
   at 
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Previous posts on the forums suggest that this error is cause by mismatch
of versions between nodes, but that's not the case for me. All nodes run
0.20.2 and function just great 99% of the time.

Any suggestions/ideas would be much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

simonw_2 · March 11, 2013, 8:54pm

hey can you provide some more infos, like which version etc.?

simon

On Monday, March 11, 2013 9:48:09 PM UTC+1, govind201 wrote:

Over the weekend, things went up a notch - this particular cluster became
unresponsive nearly every 2 hours (earlier, it was once a week-ish). I've
added more nodes, stopped all indexing jobs (the cluster is under almost no
load now), done an full optimize, but the issue continues to persist.

Did a tcpdump of port 9300 on the two faulty servers and packets seem to
be flowing just fine (I see them coming in and leaving), so it's probably
not a network issue. I can only assume that the packets are being corrupted
somehow? One other possible cause is that some of the servers run
openjdk-1.7.0_07 and some run openjdk-1.7.0_09; this is unlikely to be the
issue though, since these exceptions happen even between nodes of the same
java version.

I'm lost of ideas. Any help would be really appreciated!

On Wednesday, March 6, 2013 7:21:30 PM UTC-8, Govind Chandrasekhar wrote:
Also found this from a few days ago when the same error had occurred:

*[2013-03-03 19:55:13,183][DEBUG][action.admin.cluster.node.stats]
[elasticsearch-server-3] failed to execute on node [k9Dq3NsbQCq-RPmh4FS-2w]
*
org.elasticsearch.transport.RemoteTransportException: Failed to
deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
Caused by: org.elasticsearch.transport.TransportSerializationException:
Failed to deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Caused by: java.lang.IndexOutOfBoundsException: Readable byte limit
exceeded: 300
   at 
org.elasticsearch.common.netty.buffer.AbstractChannelBuffer.readByte(AbstractChannelBuffer.java:236)
*
   at 
org.elasticsearch.transport.netty.ChannelBufferStreamInput.readByte(ChannelBufferStreamInput.java:121)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:99)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:130)
*
   at 
org.elasticsearch.common.io.stream.AdapterStreamInput.readLong(AdapterStreamInput.java:93)
*
   at 
org.elasticsearch.monitor.os.OsStats.readFrom(OsStats.java:204)*
   at 
org.elasticsearch.monitor.os.OsStats.readOsStats(OsStats.java:193)*
   at 
org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:263)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:148)
*
   ... 23 more*
On Wednesday, March 6, 2013 6:59:31 PM UTC-8, Govind Chandrasekhar wrote:
Every once in a while, one of my master nodes loses connection with the
other (primary and slave) nodes. When I execute 'curl -XGET
localhost:9200/_nodes', the cluster just hangs and I get no response
(cluster health reports that everything is "green" though).

I found this in my log files during one of the errors today:
2013-03-07 00:44:38,588][WARN ][transport.netty ]
[elasticsearch-server-3] exception caught on transport layer [[id:
0x30a053f6, /10.30.141.74:37560 => /10.151.17.197:9300]], closing
connection
java.io.StreamCorruptedException: invalid internal transport message
format
   at 
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Previous posts on the forums suggest that this error is cause by
mismatch of versions between nodes, but that's not the case for me. All
nodes run 0.20.2 and function just great 99% of the time.

Any suggestions/ideas would be much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

govind201_2 · March 11, 2013, 9:06pm

Yup. All nodes run 0.20.2, as mentioned above. There're 9 nodes, 7 of which
are 17GB RAM machines (run openjdk-1.7.0_09), and the other two are 8GB RAM
(openjdk-1.7.0_07). Each data point has 2 replicas and I've used shard
allocation awareness
(Elasticsearch Platform — Find real-time answers at scale | Elastic), so
each "rack" has 3 nodes and 1 copy of all the data. I'm running the cluster
on Amazon EC2, so I use ec2-discovery. Transport occurs over port 9300.

I've been running on this configuration for several months now ... these
problems are very recent.

On Monday, March 11, 2013 1:54:37 PM UTC-7, simonw wrote:

hey can you provide some more infos, like which version etc.?

simon

On Monday, March 11, 2013 9:48:09 PM UTC+1, govind201 wrote:
Over the weekend, things went up a notch - this particular cluster became
unresponsive nearly every 2 hours (earlier, it was once a week-ish). I've
added more nodes, stopped all indexing jobs (the cluster is under almost no
load now), done an full optimize, but the issue continues to persist.

Did a tcpdump of port 9300 on the two faulty servers and packets seem to
be flowing just fine (I see them coming in and leaving), so it's probably
not a network issue. I can only assume that the packets are being corrupted
somehow? One other possible cause is that some of the servers run
openjdk-1.7.0_07 and some run openjdk-1.7.0_09; this is unlikely to be the
issue though, since these exceptions happen even between nodes of the same
java version.

I'm lost of ideas. Any help would be really appreciated!

On Wednesday, March 6, 2013 7:21:30 PM UTC-8, Govind Chandrasekhar wrote:
Also found this from a few days ago when the same error had occurred:

*[2013-03-03 19:55:13,183][DEBUG][action.admin.cluster.node.stats]
[elasticsearch-server-3] failed to execute on node [k9Dq3NsbQCq-RPmh4FS-2w]
*
org.elasticsearch.transport.RemoteTransportException: Failed to
deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
Caused by:
org.elasticsearch.transport.TransportSerializationException: Failed to
deserialize response of type
[org.elasticsearch.action.admin.cluster.node.stats.NodeStats]
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Caused by: java.lang.IndexOutOfBoundsException: Readable byte limit
exceeded: 300
   at 
org.elasticsearch.common.netty.buffer.AbstractChannelBuffer.readByte(AbstractChannelBuffer.java:236)
*
   at 
org.elasticsearch.transport.netty.ChannelBufferStreamInput.readByte(ChannelBufferStreamInput.java:121)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readInt(StreamInput.java:99)
*
   at 
org.elasticsearch.common.io.stream.StreamInput.readLong(StreamInput.java:130)
*
   at 
org.elasticsearch.common.io.stream.AdapterStreamInput.readLong(AdapterStreamInput.java:93)
*
   at 
org.elasticsearch.monitor.os.OsStats.readFrom(OsStats.java:204)*
   at 
org.elasticsearch.monitor.os.OsStats.readOsStats(OsStats.java:193)*
   at 
org.elasticsearch.action.admin.cluster.node.stats.NodeStats.readFrom(NodeStats.java:263)
*
   at 
org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:148)
*
   ... 23 more*
On Wednesday, March 6, 2013 6:59:31 PM UTC-8, Govind Chandrasekhar wrote:
Every once in a while, one of my master nodes loses connection with the
other (primary and slave) nodes. When I execute 'curl -XGET
localhost:9200/_nodes', the cluster just hangs and I get no response
(cluster health reports that everything is "green" though).

I found this in my log files during one of the errors today:
2013-03-07 00:44:38,588][WARN ][transport.netty ]
[elasticsearch-server-3] exception caught on transport layer [[id:
0x30a053f6, /10.30.141.74:37560 => /10.151.17.197:9300]], closing
connection
java.io.StreamCorruptedException: invalid internal transport message
format
   at 
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:27)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
*
   at 
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
*
   at 
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:787)
*
   at 
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:560)
*
   at 
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:555)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
*
   at 
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:107)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:313)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:88)
*
   at 
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
*
   at 
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
*
   at 
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
*
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
*
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
*
   at java.lang.Thread.run(Thread.java:722)*
Previous posts on the forums suggest that this error is cause by
mismatch of versions between nodes, but that's not the case for me. All
nodes run 0.20.2 and function just great 99% of the time.

Any suggestions/ideas would be much appreciated.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Serialization issues on 0.90.3 Elasticsearch	9	503	July 6, 2017
Need help understanding an error message Elasticsearch	7	909	July 6, 2017
Another odd ES freak out Elasticsearch	6	536	July 6, 2017
Warn which crashes server Elasticsearch	16	2154	July 6, 2017
Can connect with http, but NOT through Java API Elasticsearch	9	1408	July 6, 2017

Loss of Connection between Nodes

Related topics