Hi
We have made a simple test on rebalacing of shards.
Start-state:
One index with 3 shards (1 replica)
Two nodes running (having Node1, Node2 and Node3 in unicast list):
Node1 running primary of shard1, primary of shard2 and replica of
shard3
Node2 running primary of shard3, replica of shard1 and replica of
shard2
Action:
We start a new node (Node3 - also having Node1, Node2 and Node3 in its
unicast list) that joins the cluster
End-state (after rebalancing has finished)
Three nodes:
Node1 running primary of shard1 and primary of shard2
Node2 running primary of shard3
Node3 running replica of shard1, replica of shard2 and replica of
shard3
Basically ALL replicas have been moved to the new node.
Again (as in https://groups.google.com/group/elasticsearch/browse_thread/thread/232fdc4e560d41d)
we think that this is a very strange rebalancing of shards that ES
decided to do. But this time there where even bigger problems.
We did another action:
Stopped the new node (Node3) again.
Now rebalancing the replicas back to the remaining nodes (Node1 and
Node2) start. After a while the exception shown below occurs on one of
the remaining nodes, and afterwards the index has been corrupted. Now,
no matter what we do (restart etc.), the cluster will not "accept" the
index again. We never get "contact to" the index again and the data
can be considered lost - this would be very bad in production.
I notice the OutOfMemoryError, but really that shouldnt happen and
indeed, if it happens, it shouldnt corrupt the index/data for good.
Any ideas about what to do? Solutions? Comments?
Regards, Per Steffensen
------------- exception ----------------------------
[2011-10-14 09:43:52,113][WARN ][transport.netty ] [Sybil
Dorn] Exception caught on netty layer [[id: 0x5dc433a2, /
192.168.88.240:60385 => /192.168.88.241:9300]]
java.lang.OutOfMemoryError: Java heap space
[2011-10-14 09:43:52,114][WARN ][transport.netty ] [Sybil
Dorn] Exception caught on netty layer [[id: 0x5dc433a2, /
192.168.88.240:60385 => /192.168.88.241:9300]]
java.io.StreamCorruptedException: invalid data length: 0
at
org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:
42)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:
282)
at
org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:
216)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:
80)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:
564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline
$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:
783)
at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:
65)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:
564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:
559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:
274)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:
261)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:
349)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:
280)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:
200)
at
org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:
108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker
$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)