Problem on shard allocation when upgrading from 1.2.2 to 1.30


(Antonio Augusto Santos) #1

Dear,

I've upgraded from 1.2.2 to 1.3.0 today and I've found one issue.
I've a 3 node cluster (one without data) running on CentOS 6.5. I've done a
rolling upgrade: first upgraded the no data node (server_0), then shutdown
server_1, upgraded (with the RPM), and restarted it. All shards came back
to live no problem, and my cluster was green.
Then I've shutdown node 3 (server_2), upgraded and restarted ES. After this
almost everything is back to normal but one shard from one index. And I'm
getting the following error on server_2

[2014-07-24 10:47:59,575][WARN ][cluster.action.shard ] [server_2] [
MY_INDEX][0] received shard failed for [MY_INDEX][0], node[Y0EJ2oh2QI-
cdh2Jxi9z4A], [R], s[INITIALIZING], indexUUID [OR_0aHy6TIiHZVK_9PaPBQ],
reason [Failed to start shard, message [RecoveryFailedException[[MY_INDEX][0
]: Recovery failed from [server_1][pkRqLLmtS8iUxF80uQJzFw][server_1][inet[/
XXX.XXX.XXX.001:9300]]{master=true} into [server_2][Y0EJ2oh2QI-cdh2Jxi9z4A][
server_2][inet[/XXX.XXX.XXX.002:9300]]{master=true}]; nested:
RemoteTransportException[[server_1][inet[/XXX.XXX.XXX.001:9300]][index/shard
/recovery/startRecovery]]; nested: RecoveryEngineException[[MY_INDEX][0]
Phase[2] Execution failed]; nested: RemoteTransportException[[server_2][inet
[/XXX.XXX.XXX.002:9300]][index/shard/recovery/prepareTranslog]]; nested:
EngineCreationFailureException[[MY_INDEX][0] failed to open reader on writer
]; nested: FileNotFoundException[No such file [_drr_Lucene45_0.dvm]]; ]]

From what I got from the message server_1 i trying to send
_drr_Lucene45_9.dvm to server_2 but can't find it. I tried looking on */var/lib/elasticsearch/MY_CLUSTER/nodes/0/indices/MY_INDEX/0/index
*on server_1, and there is no such file, but there is a
_drr_Lucene49_0.dvm.

I've checked and both servers are running on 1.3.0:

ps -ef | grep elastic

498 121131 1 55 10:46 ? 00:11:41 /usr/bin/java -Xms5g -
Xmx5g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+
UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+
UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+
DisableExplicitGC -Djna.tmpdir=/usr/share/elasticsearch/tmp -Djava.io.tmpdir
=/usr/share/elasticsearch/tmp -Delasticsearch -Des.pidfile=/var/run/
elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.3.0.jar:/usr/share/
elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/usr/share/elasticsearch/tmp
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

ps -ef | grep elastic

498 25573 1 59 10:41 ? 00:15:15 /usr/bin/java -Xms5g
-Xmx5g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-XX:+DisableExplicitGC -Djna.tmpdir=/usr/share/elasticsearch/tmp
-Djava.io.tmpdir=/usr/share/elasticsearch/tmp -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.3.0.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/usr/share/elasticsearch/tmp
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

I've already restarted ES on server_2 to no vail. I haven't restarted
server_1 because I'm afraid to loose the data that is there (my searches
appear to be working Ok, and returning expected results).

Maybe something went wrong during the update? Any suggestions on how to fix
this problem?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/651bc861-36f6-43b9-808c-2bfb8541de56%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Antonio Augusto Santos) #2

Well, don't know what happened but suddenly the shard replicated.
I was trying to copy the index to a new one with stream2es and when the
copy started the cluster state turned green...

The problem, for me, is solved.

On Thursday, July 24, 2014 11:13:41 AM UTC-3, Antonio Augusto Santos wrote:

Dear,

I've upgraded from 1.2.2 to 1.3.0 today and I've found one issue.
I've a 3 node cluster (one without data) running on CentOS 6.5. I've done
a rolling upgrade: first upgraded the no data node (server_0), then
shutdown server_1, upgraded (with the RPM), and restarted it. All shards
came back to live no problem, and my cluster was green.
Then I've shutdown node 3 (server_2), upgraded and restarted ES. After
this almost everything is back to normal but one shard from one index. And
I'm getting the following error on server_2

[2014-07-24 10:47:59,575][WARN ][cluster.action.shard ] [server_2] [
MY_INDEX][0] received shard failed for [MY_INDEX][0], node[Y0EJ2oh2QI-
cdh2Jxi9z4A], [R], s[INITIALIZING], indexUUID [OR_0aHy6TIiHZVK_9PaPBQ],
reason [Failed to start shard, message [RecoveryFailedException[[MY_INDEX
][0]: Recovery failed from [server_1][pkRqLLmtS8iUxF80uQJzFw][server_1][
inet[/XXX.XXX.XXX.001:9300]]{master=true} into [server_2][Y0EJ2oh2QI-
cdh2Jxi9z4A][server_2][inet[/XXX.XXX.XXX.002:9300]]{master=true}]; nested:
RemoteTransportException[[server_1][inet[/XXX.XXX.XXX.001:9300]][index/
shard/recovery/startRecovery]]; nested: RecoveryEngineException[[MY_INDEX
][0] Phase[2] Execution failed]; nested: RemoteTransportException[[
server_2][inet[/XXX.XXX.XXX.002:9300]][index/shard/recovery/
prepareTranslog]]; nested: EngineCreationFailureException[[MY_INDEX][0]
failed to open reader on writer]; nested: FileNotFoundException[No such
file [_drr_Lucene45_0.dvm]]; ]]

From what I got from the message server_1 i trying to send
_drr_Lucene45_9.dvm to server_2 but can't find it. I tried looking on */var/lib/elasticsearch/MY_CLUSTER/nodes/0/indices/MY_INDEX/0/index
*on server_1, and there is no such file, but there is a
_drr_Lucene49_0.dvm.

I've checked and both servers are running on 1.3.0:

ps -ef | grep elastic

498 121131 1 55 10:46 ? 00:11:41 /usr/bin/java -Xms5g -
Xmx5g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC -XX:+
UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+
UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+
DisableExplicitGC -Djna.tmpdir=/usr/share/elasticsearch/tmp -Djava.io.
tmpdir=/usr/share/elasticsearch/tmp -Delasticsearch -Des.pidfile=/var/run/
elasticsearch/elasticsearch.pid -Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.3.0.jar:/usr/share/
elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/usr/share/elasticsearch/tmp
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

ps -ef | grep elastic

498 25573 1 59 10:41 ? 00:15:15 /usr/bin/java -Xms5g
-Xmx5g -Xss256k -Djava.awt.headless=true -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError
-XX:+DisableExplicitGC -Djna.tmpdir=/usr/share/elasticsearch/tmp
-Djava.io.tmpdir=/usr/share/elasticsearch/tmp -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.3.0.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/var/log/elasticsearch
-Des.default.path.data=/var/lib/elasticsearch
-Des.default.path.work=/usr/share/elasticsearch/tmp
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

I've already restarted ES on server_2 to no vail. I haven't restarted
server_1 because I'm afraid to loose the data that is there (my searches
appear to be working Ok, and returning expected results).

Maybe something went wrong during the update? Any suggestions on how to
fix this problem?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f632464-d749-4794-a5b4-fca31e475470%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3