Initial recovery failing "Unknown alias name was passed to alias Filter"

Hello,

I'm playing with all ES clustering feature right now as we intend to deploy the search engine sometimes soon. I'm using unicast discovery to configure my cluster nodes with some Chef help (currently I have 2 nodes). In the gateway settings, I'm also setting that gateway.expected_nodes = 2. That's pretty much the only configuration I'm doing. I'm also boostrap.mlockall'ing the memory.

I've managed to start the whole cluster from an empty database and reindex my date inside correctly without any issue. Works as advertised. I then tried to shut down one of the node, delete its data and start it again.

After some time, it fails starting with a never-ending error[1]. I've not yet setup a shared gateway as I thought this was only there to speed up the recovery (and ensure indices are stored safely in S3 for example) but I wonder it the issue I see here is related ?

Any hint ?
Cheers,

Pierre Baillet
Server Shepherd @ http://www.fotopedia.com

1: 2011-08-03_07:37:45.37053 [2011-08-03 07:37:45,368][WARN ][indices.cluster ] [Sluggo] [picor][0] failed to start shard
2011-08-03_07:37:45.37057 org.elasticsearch.index.shard.recovery.RecoveryFailedException: Index Shard [picor][0]: Recovery failed from [Reilly, Ben][1OvwxQYOSiCherwlLvReOQ][inet[/------:9300]] into [Sluggo][NrNDzEgkQQaB1HKvMqa16g][inet[/------:9300]]
2011-08-03_07:37:45.37058 at org.elasticsearch.index.shard.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:243)
2011-08-03_07:37:45.37061 at org.elasticsearch.index.shard.recovery.RecoveryTarget.access$100(RecoveryTarget.java:61)
2011-08-03_07:37:45.37064 at org.elasticsearch.index.shard.recovery.RecoveryTarget$2.run(RecoveryTarget.java:146)
2011-08-03_07:37:45.37065 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37065 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37066 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37066 Caused by: org.elasticsearch.transport.RemoteTransportException: [Reilly, Ben][inet[/10.203.38.176:9300]][index/shard/recovery/startRecovery]
2011-08-03_07:37:45.37067 Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [picor][0] Phase[2] Execution failed
2011-08-03_07:37:45.37069 at org.elasticsearch.index.engine.robin.RobinEngine.recover(RobinEngine.java:1030)
2011-08-03_07:37:45.37070 at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:420)
2011-08-03_07:37:45.37070 at org.elasticsearch.index.shard.recovery.RecoverySource.recover(RecoverySource.java:110)
2011-08-03_07:37:45.37071 at org.elasticsearch.index.shard.recovery.RecoverySource.access$1600(RecoverySource.java:60)
2011-08-03_07:37:45.37071 at org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:296)
2011-08-03_07:37:45.37074 at org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:285)
2011-08-03_07:37:45.37075 at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37075 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37076 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37077 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37077 Caused by: org.elasticsearch.transport.RemoteTransportException: [Sluggo][inet[/10.83.58.138:9300]][index/shard/recovery/translogOps]
2011-08-03_07:37:45.37079 Caused by: org.elasticsearch.indices.InvalidAliasNameException: [picor] Invalid alias name [users], Unknown alias name was passed to alias Filter
2011-08-03_07:37:45.37080 at org.elasticsearch.index.aliases.IndexAliasesService.aliasFilter(IndexAliasesService.java:87)
2011-08-03_07:37:45.37080 at org.elasticsearch.index.shard.service.InternalIndexShard.innerDeleteByQuery(InternalIndexShard.java:337)
2011-08-03_07:37:45.37081 at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:516)
2011-08-03_07:37:45.37082 at org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:340)
2011-08-03_07:37:45.37082 at org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:326)
2011-08-03_07:37:45.37083 at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37084 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37085 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37086 at java.lang.Thread.run(Thread.java:636)

I can see where the problem is coming from, opened a bug:
Delete By Query wrongly persisted to translog · Issue #1198 · elastic/elasticsearch · GitHub.

On Wed, Aug 3, 2011 at 10:58 AM, Pierre Baillet oct@fotopedia.com wrote:

Hello,

I'm playing with all ES clustering feature right now as we intend to deploy
the search engine sometimes soon. I'm using unicast discovery to configure
my cluster nodes with some Chef help (currently I have 2 nodes). In the
gateway settings, I'm also setting that gateway.expected_nodes = 2. That's
pretty much the only configuration I'm doing. I'm also boostrap.mlockall'ing
the memory.

I've managed to start the whole cluster from an empty database and reindex
my date inside correctly without any issue. Works as advertised. I then
tried to shut down one of the node, delete its data and start it again.

After some time, it fails starting with a never-ending error[1]. I've not
yet setup a shared gateway as I thought this was only there to speed up the
recovery (and ensure indices are stored safely in S3 for example) but I
wonder it the issue I see here is related ?

Any hint ?
Cheers,

Pierre Baillet
Server Shepherd @ http://www.fotopedia.com

1: 2011-08-03_07:37:45.37053 [2011-08-03 07:37:45,368][WARN
][indices.cluster ] [Sluggo] [picor][0] failed to start shard
2011-08-03_07:37:45.37057
org.elasticsearch.index.shard.recovery.RecoveryFailedException: Index Shard
[picor][0]: Recovery failed from [Reilly,
Ben][1OvwxQYOSiCherwlLvReOQ][inet[/------:9300]] into
[Sluggo][NrNDzEgkQQaB1HKvMqa16g][inet[/------:9300]]
2011-08-03_07:37:45.37058 at
org.elasticsearch.index.shard.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:243)
2011-08-03_07:37:45.37061 at
org.elasticsearch.index.shard.recovery.RecoveryTarget.access$100(RecoveryTarget.java:61)
2011-08-03_07:37:45.37064 at
org.elasticsearch.index.shard.recovery.RecoveryTarget$2.run(RecoveryTarget.java:146)
2011-08-03_07:37:45.37065 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37065 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37066 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37066 Caused by:
org.elasticsearch.transport.RemoteTransportException: [Reilly,
Ben][inet[/10.203.38.176:9300]][index/shard/recovery/startRecovery]
2011-08-03_07:37:45.37067 Caused by:
org.elasticsearch.index.engine.RecoveryEngineException: [picor][0] Phase[2]
Execution failed
2011-08-03_07:37:45.37069 at
org.elasticsearch.index.engine.robin.RobinEngine.recover(RobinEngine.java:1030)
2011-08-03_07:37:45.37070 at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:420)
2011-08-03_07:37:45.37070 at
org.elasticsearch.index.shard.recovery.RecoverySource.recover(RecoverySource.java:110)
2011-08-03_07:37:45.37071 at
org.elasticsearch.index.shard.recovery.RecoverySource.access$1600(RecoverySource.java:60)
2011-08-03_07:37:45.37071 at
org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:296)
2011-08-03_07:37:45.37074 at
org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:285)
2011-08-03_07:37:45.37075 at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37075 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37076 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37077 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37077 Caused by:
org.elasticsearch.transport.RemoteTransportException:
[Sluggo][inet[/10.83.58.138:9300]][index/shard/recovery/translogOps]
2011-08-03_07:37:45.37079 Caused by:
org.elasticsearch.indices.InvalidAliasNameException: [picor] Invalid alias
name [users], Unknown alias name was passed to alias Filter
2011-08-03_07:37:45.37080 at
org.elasticsearch.index.aliases.IndexAliasesService.aliasFilter(IndexAliasesService.java:87)
2011-08-03_07:37:45.37080 at
org.elasticsearch.index.shard.service.InternalIndexShard.innerDeleteByQuery(InternalIndexShard.java:337)
2011-08-03_07:37:45.37081 at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:516)
2011-08-03_07:37:45.37082 at
org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:340)
2011-08-03_07:37:45.37082 at
org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:326)
2011-08-03_07:37:45.37083 at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37084 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37085 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37086 at java.lang.Thread.run(Thread.java:636)

Great, thanks for the faster-than-anything fix :slight_smile:

--
Pierre Baillet
Server Shepherd @ http://www.fotopedia.com

Le mercredi 3 août 2011 à 10:32, Shay Banon a écrit :

I can see where the problem is coming from, opened a bug: Delete By Query wrongly persisted to translog · Issue #1198 · elastic/elasticsearch · GitHub.

On Wed, Aug 3, 2011 at 10:58 AM, Pierre Baillet <oct@fotopedia.com (mailto:oct@fotopedia.com)> wrote:

Hello,

I'm playing with all ES clustering feature right now as we intend to deploy the search engine sometimes soon. I'm using unicast discovery to configure my cluster nodes with some Chef help (currently I have 2 nodes). In the gateway settings, I'm also setting that gateway.expected_nodes = 2. That's pretty much the only configuration I'm doing. I'm also boostrap.mlockall'ing the memory.

I've managed to start the whole cluster from an empty database and reindex my date inside correctly without any issue. Works as advertised. I then tried to shut down one of the node, delete its data and start it again.

After some time, it fails starting with a never-ending error[1]. I've not yet setup a shared gateway as I thought this was only there to speed up the recovery (and ensure indices are stored safely in S3 for example) but I wonder it the issue I see here is related ?

Any hint ?
Cheers,

Pierre Baillet
Server Shepherd @ http://www.fotopedia.com

1: 2011-08-03_07:37:45.37053 [2011-08-03 07:37:45,368][WARN ][indices.cluster ] [Sluggo] [picor][0] failed to start shard
2011-08-03_07:37:45.37057 org.elasticsearch.index.shard.recovery.RecoveryFailedException: Index Shard [picor][0]: Recovery failed from [Reilly, Ben][1OvwxQYOSiCherwlLvReOQ][inet[/------:9300]] into [Sluggo][NrNDzEgkQQaB1HKvMqa16g][inet[/------:9300]]
2011-08-03_07:37:45.37058 at org.elasticsearch.index.shard.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:243)
2011-08-03_07:37:45.37061 at org.elasticsearch.index.shard.recovery.RecoveryTarget.access$100(RecoveryTarget.java:61)
2011-08-03_07:37:45.37064 at org.elasticsearch.index.shard.recovery.RecoveryTarget$2.run(RecoveryTarget.java:146)
2011-08-03_07:37:45.37065 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37065 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37066 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37066 Caused by: org.elasticsearch.transport.RemoteTransportException: [Reilly, Ben][inet[/10.203.38.176:9300]][index/shard/recovery/startRecovery]
2011-08-03_07:37:45.37067 Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [picor][0] Phase[2] Execution failed
2011-08-03_07:37:45.37069 at org.elasticsearch.index.engine.robin.RobinEngine.recover(RobinEngine.java:1030)
2011-08-03_07:37:45.37070 at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:420)
2011-08-03_07:37:45.37070 at org.elasticsearch.index.shard.recovery.RecoverySource.recover(RecoverySource.java:110)
2011-08-03_07:37:45.37071 at org.elasticsearch.index.shard.recovery.RecoverySource.access$1600(RecoverySource.java:60)
2011-08-03_07:37:45.37071 at org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:296)
2011-08-03_07:37:45.37074 at org.elasticsearch.index.shard.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:285)
2011-08-03_07:37:45.37075 at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37075 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37076 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37077 at java.lang.Thread.run(Thread.java:636)
2011-08-03_07:37:45.37077 Caused by: org.elasticsearch.transport.RemoteTransportException: [Sluggo][inet[/10.83.58.138:9300]][index/shard/recovery/translogOps]
2011-08-03_07:37:45.37079 Caused by: org.elasticsearch.indices.InvalidAliasNameException: [picor] Invalid alias name [users], Unknown alias name was passed to alias Filter
2011-08-03_07:37:45.37080 at org.elasticsearch.index.aliases.IndexAliasesService.aliasFilter(IndexAliasesService.java:87)
2011-08-03_07:37:45.37080 at org.elasticsearch.index.shard.service.InternalIndexShard.innerDeleteByQuery(InternalIndexShard.java:337)
2011-08-03_07:37:45.37081 at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:516)
2011-08-03_07:37:45.37082 at org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:340)
2011-08-03_07:37:45.37082 at org.elasticsearch.index.shard.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:326)
2011-08-03_07:37:45.37083 at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:238)
2011-08-03_07:37:45.37084 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
2011-08-03_07:37:45.37085 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
2011-08-03_07:37:45.37086 at java.lang.Thread.run(Thread.java:636)

AH! That's exactly what we've been seeing but haven't been able to track
down. Thanks for the fix. Now do you have any idea how we can get from a
running 0.17.2-SNAPSHOT instance to I guess a 0.17.3-SNAPSHOT instance
without losing the index? We definitely have done lots of Delete By Queries
on the currently running 0.17.2-SNAPSHOT cluster. -- Dan

you will need to delete the transaction logs for each shard, which means you
will loose that relevant data. I will release 0.17.3 either today or
tomorrow to address this and other bugs that have been fixed, and make a
note that you should flush before upgrading to it so one won't hit it.

On Wed, Aug 3, 2011 at 6:40 PM, Dan Pilone dan@element84.com wrote:

AH! That's exactly what we've been seeing but haven't been able to track
down. Thanks for the fix. Now do you have any idea how we can get from a
running 0.17.2-SNAPSHOT instance to I guess a 0.17.3-SNAPSHOT instance
without losing the index? We definitely have done lots of Delete By Queries
on the currently running 0.17.2-SNAPSHOT cluster. -- Dan

Beautiful!! It worked great and we're running on a 0.17.3-SNAPSHOT build
without issue. Thanks a ton. And if you're ever in Washington, DC I owe
you a couple drinks. -- Dan