Hi,
I've encountered multiple issues after configuring shard allocation
awareness and doing a full cluster restart. From previous experience I can
say I've already seen such issues so I'm not sure if it's related to the
configuration change or the restart. It's going to be a long post so please
bare with me.
I started out with 5 servers running 0.18.5 and tried to add one additional
server using a different zone. The servers are not very busy or loaded so
one server can hold the data of the other 5. The 5 servers are connected
over fast LAN and were using multicast discovery and the new server was
connected over WAN and using unicast. I've also configured forced awareness
to make sure the data got replicated the way I wanted it to. Then restarted
the cluster servers one by one for the configuration changes to take
effect. After the restart a lot of shards remained unallocated, even after
a long time, so I've tried various things but here is a summary of the
issues:
- Shards which are available on disk do not get loaded with message
saying they are not there. - Some of the original servers didn't know the new server and said it
wasn't a part of the cluster, this problem was mostly resolved after
switching to full unicast configuration). - When the master node was taken down the other servers didn't elect a
new master even after a very long wait. They just kept saying it was
unreachable. - When the master was brought up again some servers would not connect to
it anymore. - When trying to recover (using the local gateway) servers would try to
connect to other servers and fail, usually with errors of connection
interruption. - Although I have configured expected_nodes and recover_after_nodes the
cluster would still wait the recover_after_time before starting the
recovery even when all (or at least most) the nodes were up.
I've tried to get this to work for hours but the cluster state remained red
with many unloaded shards, eventually I gave up, wiped the entire cluster
and reindexed the data. The next day I tried adding another remote server,
in the same zone as the one added the day before and had exactly the same
problems. This time I also tried upgrading to 0.19.3 but that didn't really
improve anything, in fact it exposed some new exceptions that were not
present before. Again, after trying quite a lot to get it to work I gave up
wiped out the entire cluster and re-indexed the data.
Unfortunately I don't have the logs to support all of the above issues
because there seems to be a problem with the log rolling. It doesn't create
a new daily file but just overwrites the existing file, so all I have is
what I could scrape from open console sessions. The following exceptions
are not in any particular order, they are just random things I picked up
from my consoles but all of them happened many many times:
Index out of range exception:
[2012-05-11 05:07:58,092][WARN ][cluster.action.shard ] [es1-04]
sending failed shard for [6136561_xs][0], node[JYRwNCxDSESuFFQ5jvvSEg],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[6136561_xs][0] failed to recover
shard]; nested: StringIndexOutOfBoundsException[String index out of range:
0]; ]]
[2012-05-11 05:07:58,602][WARN ][indices.cluster ] [es1-04]
[6136561_xs][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[6136561_xs][0] failed to recover shard
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:204)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of
range: 0
at java.lang.String.charAt(Unknown Source)
at
org.elasticsearch.index.mapper.MapperService.add(MapperService.java:180)
at
org.elasticsearch.index.mapper.MapperService.add(MapperService.java:172)
at
org.elasticsearch.index.mapper.MapperService.documentMapperWithAutoCreate(MapperService.java:298)
at
org.elasticsearch.index.shard.service.InternalIndexShard.prepareIndex(InternalIndexShard.java:310)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:624)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:199)
... 4 more
[2012-05-11 05:07:59,065][WARN ][cluster.action.shard ] [es1-04]
sending failed shard for [6136561_xs][0], node[JYRwNCxDSESuFFQ5jvvSEg],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[6136561_xs][0] failed to recover
shard]; nested: StringIndexOutOfBoundsException[String index out of range:
0]; ]]
Master was shut down but no other master was re-elected:
[2012-05-11 05:07:59,143][INFO ][discovery.zen ] [es1-04]
master_left
[[es1-01][LTA8l-oURImdrLeGo4HNDQ][inet[/10.1.101.151:9300]]{data_center=ny4}],
reason [failed to ping, tried [5] times, each with maximum [5s] timeout]
[2012-05-11 05:08:01,513][INFO ][discovery.zen ] [es1-04]
master_left
[[es1-01][LTA8l-oURImdrLeGo4HNDQ][inet[/10.1.101.151:9300]]{data_center=ny4}],
reason [failed to ping, tried [5] times, each with maximum [5s] timeout]
Claiming the shard doesn't exist locally but I've verified that it does (at
least such a directory exists and it has what seems like legitimate
content):
[2012-05-11 05:07:31,359][WARN ][indices.cluster ] [es1-02]
[6090102_m][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[6090102_m][0] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:108)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Can't get cluster status:
[2012-05-11 05:08:14,426][DEBUG][action.admin.cluster.node.stats] [es1-05]
failed to execute on node [1zK5Dj6wT6OqNtwDJpJP8Q]
org.elasticsearch.transport.SendRequestTransportException:
[es1-02][inet[/10.1.101.152:9300]][cluster/nodes/stats/n]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:200)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.start(TransportNodesOperationAction.java:172)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$300(TransportNodesOperationAction.java:102)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:73)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction.doExecute(TransportNodesOperationAction.java:43)
at
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:61)
at
org.elasticsearch.client.node.NodeClusterAdminClient.execute(NodeClusterAdminClient.java:70)
at
org.elasticsearch.client.support.AbstractClusterAdminClient.nodesStats(AbstractClusterAdminClient.java:150)
at
org.elasticsearch.rest.action.admin.cluster.node.stats.RestNodesStatsAction.executeNodeStats(RestNodesStatsAction.java:130)
at
org.elasticsearch.rest.action.admin.cluster.node.stats.RestNodesStatsAction.handleRequest(RestNodesStatsAction.java:125)
at
org.elasticsearch.rest.RestController.executeHandler(RestController.java:159)
at
org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:142)
at
org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:120)
at
org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:82)
at
org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:259)
at
org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:43)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
at
org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:111)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at
org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.unfoldAndFireMessageReceived(ReplayingDecoder.java:567)
at
org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:551)
at
org.elasticsearch.common.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:445)
at
org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:75)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
at
org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at
org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:94)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:364)
at
org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:238)
at
org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:38)
at
org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException:
[es1-02][inet[/10.1.101.152:9300]] Node not connected
at
org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:637)
at
org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:445)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:185)
... 42 more
Another flavor of the sending shard failed problem (here without the
exception details):
[2012-05-11 05:07:31,471][WARN ][cluster.action.shard ] [es1-02]
sending failed shard for [6090092_m][3], node[1zK5Dj6wT6OqNtwDJpJP8Q], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[6090092_m][3] failed to recover
shard]; nested: StringIndexOutOfBoundsException[String index out of range:
0]; ]]
[2012-05-11 05:07:31,472][WARN ][cluster.action.shard ] [es1-02]
sending failed shard for [6134711_xs][0], node[1zK5Dj6wT6OqNtwDJpJP8Q],
[P], s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[6134711_xs][0] failed to recover
shard]; nested: StringIndexOutOfBoundsException[String index out of range:
0]; ]]
-eran