After adding a third node to the cluster, some shards won't relocate

Earlier today I added a third node to our cluster. It shares the same
version of elasticsearch (0.90.10) and jvm (1.7.0_13) as the two existing
nodes.

Now, some hours after I added the node, two shards are still "relocating".
The status of the cluster is green though. I'm getting some errors in the
log of the node I added:

[2014-02-13 19:13:43,572][WARN ][cluster.action.shard ]
[elasticsearch03] [vgd][3] sending failed shard for [vgd][3],
node[LgR5cuiCQmSfOTfTl6t1qA], relocating [VuACiBeiToyz7xEZ5RJsxQ], [P],
s[INITIALIZING], indexUUID [-5I0LkSET8GXIaOLCpnQUQ], reason [Failed to
start shard, message [RecoveryFailedException[[vgd][3]: Recovery failed
from [elasticsearch01][VuACiBeiToyz7xEZ5RJsxQ][inet[/10.84.200.129:9300]]
into [elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]];
nested:
RemoteTransportException[[elasticsearch01][inet[/10.84.200.129:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[vgd][3] Phase[2] Execution failed];
nested:
ReceiveTimeoutTransportException[[elasticsearch03][inet[/10.84.100.219:9300]][index/shard/recovery/prepareTranslog]
request_id [712931] timed out after [900000ms]]; ]]

It says it "timed out", but there is no connection issues between the nodes
as far as I can tell. The new node has ~2M docs, whereas node1 and 2 has
~45M (which is the total amount of indexed docs). The new node also uses
quite a lot CPU, as it has been doing since it joined the cluster earlier
today.

Any tips on how to debug this problem any further so I can have a three
node cluster up and running?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3624b153-3ad5-4a24-8e3d-f189e714c9fd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

There is quite a lot of errors such as the one I pasted btw.

On Thursday, February 13, 2014 7:25:22 PM UTC+1, Christer wrote:

Earlier today I added a third node to our cluster. It shares the same
version of elasticsearch (0.90.10) and jvm (1.7.0_13) as the two existing
nodes.

Now, some hours after I added the node, two shards are still "relocating".
The status of the cluster is green though. I'm getting some errors in the
log of the node I added:

[2014-02-13 19:13:43,572][WARN ][cluster.action.shard ]
[elasticsearch03] [vgd][3] sending failed shard for [vgd][3],
node[LgR5cuiCQmSfOTfTl6t1qA], relocating [VuACiBeiToyz7xEZ5RJsxQ], [P],
s[INITIALIZING], indexUUID [-5I0LkSET8GXIaOLCpnQUQ], reason [Failed to
start shard, message [RecoveryFailedException[[vgd][3]: Recovery failed
from [elasticsearch01][VuACiBeiToyz7xEZ5RJsxQ][inet[/10.84.200.129:9300]]
into [elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]];
nested:
RemoteTransportException[[elasticsearch01][inet[/10.84.200.129:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[vgd][3] Phase[2] Execution failed];
nested:
ReceiveTimeoutTransportException[[elasticsearch03][inet[/10.84.100.219:9300]][index/shard/recovery/prepareTranslog]
request_id [712931] timed out after [900000ms]]; ]]

It says it "timed out", but there is no connection issues between the
nodes as far as I can tell. The new node has ~2M docs, whereas node1 and 2
has ~45M (which is the total amount of indexed docs). The new node also
uses quite a lot CPU, as it has been doing since it joined the cluster
earlier today.

Any tips on how to debug this problem any further so I can have a three
node cluster up and running?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e4eb4f5-9f73-4d21-819a-743ce0e0ffb6%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

And on elasticsearch01 (the node referred to in the error message) I'm
seeing a whole lot of these:

[2014-02-13 19:13:42,391][WARN ][transport ]
[elasticsearch01] Received response for a request that has timed out, sent
[1138869ms] ago, timed out [238869ms] ago, action
[index/shard/recovery/prepareTranslog], node
[[elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]], id
[702105]
[2014-02-13 19:13:43,573][WARN ][cluster.action.shard ]
[elasticsearch01] [vgd][3] received shard failed for [vgd][3],
node[LgR5cuiCQmSfOTfTl6t1qA], relocating [VuACiBeiToyz7xEZ5RJsxQ], [P],
s[INITIALIZING], indexUUID [-5I0LkSET8GXIaOLCpnQUQ], reason [Failed to
start shard, message [RecoveryFailedException[[vgd][3]: Recovery failed
from [elasticsearch01][VuACiBeiToyz7xEZ5RJsxQ][inet[/10.84.200.129:9300]]
into [elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]];
nested:
RemoteTransportException[[elasticsearch01][inet[/10.84.200.129:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[vgd][3] Phase[2] Execution failed];
nested:
ReceiveTimeoutTransportException[[elasticsearch03][inet[/10.84.100.219:9300]][index/shard/recovery/prepareTranslog]
request_id [712931] timed out after [900000ms]]; ]]

On Thursday, February 13, 2014 7:25:22 PM UTC+1, Christer wrote:

Earlier today I added a third node to our cluster. It shares the same
version of elasticsearch (0.90.10) and jvm (1.7.0_13) as the two existing
nodes.

Now, some hours after I added the node, two shards are still "relocating".
The status of the cluster is green though. I'm getting some errors in the
log of the node I added:

[2014-02-13 19:13:43,572][WARN ][cluster.action.shard ]
[elasticsearch03] [vgd][3] sending failed shard for [vgd][3],
node[LgR5cuiCQmSfOTfTl6t1qA], relocating [VuACiBeiToyz7xEZ5RJsxQ], [P],
s[INITIALIZING], indexUUID [-5I0LkSET8GXIaOLCpnQUQ], reason [Failed to
start shard, message [RecoveryFailedException[[vgd][3]: Recovery failed
from [elasticsearch01][VuACiBeiToyz7xEZ5RJsxQ][inet[/10.84.200.129:9300]]
into [elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]];
nested:
RemoteTransportException[[elasticsearch01][inet[/10.84.200.129:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[vgd][3] Phase[2] Execution failed];
nested:
ReceiveTimeoutTransportException[[elasticsearch03][inet[/10.84.100.219:9300]][index/shard/recovery/prepareTranslog]
request_id [712931] timed out after [900000ms]]; ]]

It says it "timed out", but there is no connection issues between the
nodes as far as I can tell. The new node has ~2M docs, whereas node1 and 2
has ~45M (which is the total amount of indexed docs). The new node also
uses quite a lot CPU, as it has been doing since it joined the cluster
earlier today.

Any tips on how to debug this problem any further so I can have a three
node cluster up and running?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3a946189-29cf-4e3b-b066-fa28decb36d8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

After I disabled the index.shard.check_on_startup: true option from the
node, it managed to join the cluster, and all shards have since been
successfully relocated.

On Thursday, February 13, 2014 7:25:22 PM UTC+1, Christer wrote:

Earlier today I added a third node to our cluster. It shares the same
version of elasticsearch (0.90.10) and jvm (1.7.0_13) as the two existing
nodes.

Now, some hours after I added the node, two shards are still "relocating".
The status of the cluster is green though. I'm getting some errors in the
log of the node I added:

[2014-02-13 19:13:43,572][WARN ][cluster.action.shard ]
[elasticsearch03] [vgd][3] sending failed shard for [vgd][3],
node[LgR5cuiCQmSfOTfTl6t1qA], relocating [VuACiBeiToyz7xEZ5RJsxQ], [P],
s[INITIALIZING], indexUUID [-5I0LkSET8GXIaOLCpnQUQ], reason [Failed to
start shard, message [RecoveryFailedException[[vgd][3]: Recovery failed
from [elasticsearch01][VuACiBeiToyz7xEZ5RJsxQ][inet[/10.84.200.129:9300]]
into [elasticsearch03][LgR5cuiCQmSfOTfTl6t1qA][inet[/10.84.100.219:9300]]];
nested:
RemoteTransportException[[elasticsearch01][inet[/10.84.200.129:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[vgd][3] Phase[2] Execution failed];
nested:
ReceiveTimeoutTransportException[[elasticsearch03][inet[/10.84.100.219:9300]][index/shard/recovery/prepareTranslog]
request_id [712931] timed out after [900000ms]]; ]]

It says it "timed out", but there is no connection issues between the
nodes as far as I can tell. The new node has ~2M docs, whereas node1 and 2
has ~45M (which is the total amount of indexed docs). The new node also
uses quite a lot CPU, as it has been doing since it joined the cluster
earlier today.

Any tips on how to debug this problem any further so I can have a three
node cluster up and running?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5ef76d28-8240-4773-91b8-9b1bcf45dfad%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.