Problems upgrading to 1.5.0

We just encountered some mysterious problems when upgrading from 1.1.1 to
1.5.0.

The cluster consists of three machines, two data nodes and one master-only
node. It hosts 86 indices which each has one replica.

I stopped writes, did a snapshot and stopped the entire cluster before I
upgraded the nodes and restarted them. The system came up and quickly
turned yellow, but it refused to become green. it failed to recover a
number of shards. The errors I got in the logs looked like this (there were
a lot):
[2015-03-31 07:33:39,704][WARN ][indices.cluster ] [NODE1]
[signal_bin][0] sending failed shard after recovery failure
org.elasticsearch.indices.recovery.RecoveryFailedException:
[signal_bin][0]: Recovery failed from
[NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1} into
[NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b,
max_local_storage_nodes=1}
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:274)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$700(RecoveryTarget.java:69)
at
org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:550)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[NODE2][inet[/IP2:9300]][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[signal_bin][0] Phase[1] Execution failed
at
org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:839)
at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:684)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
at
org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:146)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by:
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException:
[signal_bin][0] Failed to transfer [11] files with total size of [1.4mb]
at
org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:413)
at
org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:834)
... 10 more
Caused by: org.elasticsearch.transport.RemoteTransportException:
[NODE1][inet[/IP1:9300]][internal:index/shard/recovery/clean_files]
Caused by: org.elasticsearch.indices.recovery.RecoveryFailedException:
[signal_bin][0]: Recovery failed from
[NODE2][rpXLVgS8Qw2jgimXNYKn_A][NODE2][inet[/IP2:9300]]{aws_availability_zone=us-east-1d,
max_local_storage_nodes=1} into
[NODE1][tdXdf0MeS62DIO0KFZX-Rg][NODE1][inet[/IP1:9300]]{aws_availability_zone=us-east-1b,
max_local_storage_nodes=1} (failed to clean after recovery)
at
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:443)
at
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:389)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.ElasticsearchIllegalStateException: local
version: name [_yor.si], length [363], checksum [1jnqbzx], writtenBy [null]
is different from remote version after recovery: name [_yor.si], length
[363], checksum [null], writtenBy [null]
at
org.elasticsearch.index.store.Store.verifyAfterCleanup(Store.java:645)
at org.elasticsearch.index.store.Store.cleanupAndVerify(Store.java:613)
at
org.elasticsearch.indices.recovery.RecoveryTarget$CleanFilesRequestHandler.messageReceived(RecoveryTarget.java:428)
... 6 more

The index/shard mentioned varied. We finally got past his by configuring
the troubling indices to have 0 replicas and then back to 1.

Has anybody seen something similar? Did we hit a bug or did we do something
wrong?

/MaF

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f0fbd06f-0b08-49aa-a387-b78a081be59f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.