Shard copying performance

Michael_Salmon · April 29, 2014, 1:50pm

I am having trouble replicating a shard and I cannot see any possible
reason for it. After 15 minutes I get a timeout in phase 2.

The shard isn't that large about 60,000K, 5GB and 22 segments and the
translog directories are empty.
The computers in question are lightly loaded as is the network between them.
Copying all the files in the shard from all 4 disks between the two
computers with rsync takes about 40 seconds.
I can't run checkIndex on the source machine as it can't handle shards that
are spread over multiple disks but it runs quite happily on the files I
copied with rsync although it took a bit over 12 minutes to run the check.
I have ES 1.1.0 installed.
I changed some settings but none of them seem to make much difference:

"transient": {
"logger": {
"level": "TRACE"
},
"indices": {
"store": {
"throttle": {
"type": "none"
}
},
"recovery": {
"translog_size": "256MB",
"concurrent_streams": "16",
"translog_ops": "10000",
"max_bytes_per_sec": "250MB"
}
}
}

Does anyone have any tips on how I should proceed?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a85c76cb-72d5-45c4-82cf-d8c8867a2151%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

spinscale · May 5, 2014, 10:25am

Hey,

you could change your default loglevel to find out, if those settings are
actually applied (either DEBUG or TRACE). Depending on the elasticsearch
version you are using, you might want to try with a lower-cased setting of
max_bytes_per_sec and set it to "250mb". Also, can you show the exception
which contains the "timeout in phase 2"?

--Alex

On Tue, Apr 29, 2014 at 3:50 PM, Michael Salmon michael.salmon@inovia.nuwrote:

I am having trouble replicating a shard and I cannot see any possible
reason for it. After 15 minutes I get a timeout in phase 2.

The shard isn't that large about 60,000K, 5GB and 22 segments and the
translog directories are empty.
The computers in question are lightly loaded as is the network between
them.
Copying all the files in the shard from all 4 disks between the two
computers with rsync takes about 40 seconds.
I can't run checkIndex on the source machine as it can't handle shards
that are spread over multiple disks but it runs quite happily on the files
I copied with rsync although it took a bit over 12 minutes to run the check.
I have ES 1.1.0 installed.
I changed some settings but none of them seem to make much difference:

"transient": {
"logger": {
"level": "TRACE"
},
"indices": {
"store": {
"throttle": {
"type": "none"
}
},
"recovery": {
"translog_size": "256MB",
"concurrent_streams": "16",
"translog_ops": "10000",
"max_bytes_per_sec": "250MB"
}
}
}

Does anyone have any tips on how I should proceed?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a85c76cb-72d5-45c4-82cf-d8c8867a2151%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a85c76cb-72d5-45c4-82cf-d8c8867a2151%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM9gmWVVV8y8FtqC4ESVkjPoc4Giqp4feX2x4znEBDaYyg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Michael_Salmon · May 5, 2014, 11:55am

This is the exception that I posted earlier:

[2014-04-28 13:40:15,039][WARN ][cluster.action.shard] [eis05]
[ds_clearcase-vob-heat-analyzer][2] sending failed shard for
[ds_clearcase-vob-heat-analyzer][2], node[QyeTlW2YQbG27zrsdjBBGA], [R],
s[INITIALIZING], indexUUID [ms7jQeuMQduNIHCmjxsKjQ], reason [Failed to
start shard, message
[RecoveryFailedException[[ds_clearcase-vob-heat-analyzer][2]: Recovery
failed from [eis09][p8-_fzHeTR22pSlsBsYm8A][eis09.rnditlab.ericsson.se][inet[/137.58.184.239:9300]]{datacenter=PoCC}
into [eis05][QyeTlW2YQbG27zrsdjBBGA][eis05.rnditlab.ericsson.se][inet[
eis05.rnditlab.ericsson.se/137.58.184.235:9300]]{datacenter=PoCC}http://eis05.rnditlab.ericsson.se/137.58.184.235:9300]]{datacenter=PoCC}];
nested:
RemoteTransportException[[eis09][inet[/137.58.184.239:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[ds_clearcase-vob-heat-analyzer][2]
Phase[2] Execution failed]; nested:
ReceiveTimeoutTransportException[[eis05][inet[/137.58.184.235:9300]][index/shard/recovery/prepareTranslog]
request_id [6809886] timed out after [900000ms]]; ]]
[2014-04-28 14:00:11,614][WARN ][indices.cluster] [eis05]
[ds_clearcase-vob-heat-analyzer][0] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[ds_clearcase-vob-heat-analyzer][0]: Recovery failed from
[eis07][Q8ZWgDIXRGiUej1oMoH8Jg][eis07.rnditlab.ericsson.se][inet[/137.58.184.237:9300]]{datacenter=PoCC}
into [eis05][QyeTlW2YQbG27zrsdjBBGA][eis05.rnditlab.ericsson.se][inet[
eis05.rnditlab.ericsson.se/137.58.184.235:9300]]{datacenter=PoCC}http://eis05.rnditlab.ericsson.se/137.58.184.235:9300]]{datacenter=PoCC}
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:307)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[eis07][inet[/137.58.184.237:9300]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[ds_clearcase-vob-heat-analyzer][0] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1098)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:627)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:117)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:61)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:323)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[eis05][inet[/137.58.184.235:9300]][index/shard/recovery/prepareTranslog]
request_id [154592652] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
... 3 more

I checked that the max_bytes_per_sec changed in the log, it accepts both MB
and mb.

I am also changing my log level to trace but restarting the servers takes a
long while.

On Monday, 5 May 2014 12:25:50 UTC+2, Alexander Reelsen wrote:

Hey,

you could change your default loglevel to find out, if those settings are
actually applied (either DEBUG or TRACE). Depending on the elasticsearch
version you are using, you might want to try with a lower-cased setting of
max_bytes_per_sec and set it to "250mb". Also, can you show the exception
which contains the "timeout in phase 2"?

--Alex

On Tue, Apr 29, 2014 at 3:50 PM, Michael Salmon <michael...@inovia.nu<javascript:>

wrote:

I am having trouble replicating a shard and I cannot see any possible
reason for it. After 15 minutes I get a timeout in phase 2.

The shard isn't that large about 60,000K, 5GB and 22 segments and the
translog directories are empty.
The computers in question are lightly loaded as is the network between
them.
Copying all the files in the shard from all 4 disks between the two
computers with rsync takes about 40 seconds.
I can't run checkIndex on the source machine as it can't handle shards
that are spread over multiple disks but it runs quite happily on the files
I copied with rsync although it took a bit over 12 minutes to run the check.
I have ES 1.1.0 installed.
I changed some settings but none of them seem to make much difference:

"transient": {
"logger": {
"level": "TRACE"
},
"indices": {
"store": {
"throttle": {
"type": "none"
}
},
"recovery": {
"translog_size": "256MB",
"concurrent_streams": "16",
"translog_ops": "10000",
"max_bytes_per_sec": "250MB"
}
}
}

Does anyone have any tips on how I should proceed?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a85c76cb-72d5-45c4-82cf-d8c8867a2151%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a85c76cb-72d5-45c4-82cf-d8c8867a2151%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c04725d9-ef92-4c67-ac33-cb8fd96def06%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael_Salmon · March 17, 2015, 12:50pm

We recently removed index.shard.check_on_startup:fix from our settings and
haven't had this problem since. The guide says "Should shard consistency be
checked upon opening" but it appears to also affect replication. I'm not
going to say that that is wrong although it isn't what I want but I think
that the guide should be more explicit as to when the checking is done.

On Tuesday, 29 April 2014 15:50:05 UTC+2, Michael Salmon wrote:

I am having trouble replicating a shard and I cannot see any possible
reason for it. After 15 minutes I get a timeout in phase 2.

The shard isn't that large about 60,000K, 5GB and 22 segments and the
translog directories are empty.
The computers in question are lightly loaded as is the network between
them.
Copying all the files in the shard from all 4 disks between the two
computers with rsync takes about 40 seconds.
I can't run checkIndex on the source machine as it can't handle shards
that are spread over multiple disks but it runs quite happily on the files
I copied with rsync although it took a bit over 12 minutes to run the check.
I have ES 1.1.0 installed.
I changed some settings but none of them seem to make much difference:

"transient": {
"logger": {
"level": "TRACE"
},
"indices": {
"store": {
"throttle": {
"type": "none"
}
},
"recovery": {
"translog_size": "256MB",
"concurrent_streams": "16",
"translog_ops": "10000",
"max_bytes_per_sec": "250MB"
}
}
}

Does anyone have any tips on how I should proceed?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bde6fb91-7b3c-42d7-8e31-7fdb7bd5555b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
ES marking and sending shard failed due to failed recovery in enabling replication Elasticsearch	7	16047	July 5, 2017
Shards not replicating Elasticsearch	3	578	July 6, 2017
Replication timeouts Elasticsearch	3	667	July 6, 2017
Slow startup (replica recovery in logs) Elasticsearch	11	1823	July 6, 2017
Constant Recovering and Unassigned shards for an index Elasticsearch	12	1019	July 6, 2017

Shard copying performance

Related topics