Failing Replica Shards

Hello,

In the past couple of days I've been getting a lot of error messages about
corrupted replica shards. The primary shards come up fast after ES process
restart but replicas take a long time to come back. Sometimes it takes a
few node restarts to 'kick' the nodes to start replica shards.

ES version is 1.3.1 running on CentOS 6.5 hosted at Softlayer. It's a
3-way cluster with 4 logstash feeders hanging off it.

Here are the errors;

[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [downloader-2014.08][4] received shard failed for
[downloader-2014.08][4], node[l9-BQTHSSF-ElhgpPBZ24w], [R],
s[INITIALIZING], indexUUID [2vRrb5YlQP6MTVr1chOezg], reason [engine
failure, message [corrupted preexisting
index][CorruptIndexException[[downloader-2014.08][4] Corrupted index
[corrupted_SkU0-ZHZRxivSnGczABb_g] caused by: CorruptIndexException[codec
footer mismatch: actual footer=-1676705023 vs expected footer=-1071082520
(resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/downloader-2014.08/4/index/_k9a_es090_0.doc"))]]]]
[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.06][0] received shard failed for
[eventlog-2014.06][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [jbvChdRrRB6HTutxPvxMmQ], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.06][0]
Corrupted index [corrupted__712QIBQQqafzpBoQwZtcg] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.06/0/index/_1k4x.nvd"))]]]]
[2014-08-26 15:01:18,684][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.07][0] received shard failed for
[eventlog-2014.07][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [T4tTXkPjTaCdSVNTjHfOcg], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.07][0]
Corrupted index [corrupted_OzfNRRGyTIq8a1PRhLYG2w] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.07/0/index/_rqf.nvd"))]]]]


Thanks,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c0af53fb-6fdd-4624-bf6c-9b9d50081689%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hey David, I have same problem now. Have you found a solution for that
problem?

26 Ağustos 2014 Salı 23:08:55 UTC+3 tarihinde David Kleiner yazdı:

Hello,

In the past couple of days I've been getting a lot of error messages about
corrupted replica shards. The primary shards come up fast after ES process
restart but replicas take a long time to come back. Sometimes it takes a
few node restarts to 'kick' the nodes to start replica shards.

ES version is 1.3.1 running on CentOS 6.5 hosted at Softlayer. It's a
3-way cluster with 4 logstash feeders hanging off it.

Here are the errors;

[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [downloader-2014.08][4] received shard failed for
[downloader-2014.08][4], node[l9-BQTHSSF-ElhgpPBZ24w], [R],
s[INITIALIZING], indexUUID [2vRrb5YlQP6MTVr1chOezg], reason [engine
failure, message [corrupted preexisting
index][CorruptIndexException[[downloader-2014.08][4] Corrupted index
[corrupted_SkU0-ZHZRxivSnGczABb_g] caused by: CorruptIndexException[codec
footer mismatch: actual footer=-1676705023 vs expected footer=-1071082520
(resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/downloader-2014.08/4/index/_k9a_es090_0.doc"))]]]]
[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.06][0] received shard failed for
[eventlog-2014.06][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [jbvChdRrRB6HTutxPvxMmQ], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.06][0]
Corrupted index [corrupted__712QIBQQqafzpBoQwZtcg] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.06/0/index/_1k4x.nvd"))]]]]
[2014-08-26 15:01:18,684][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.07][0] received shard failed for
[eventlog-2014.07][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [T4tTXkPjTaCdSVNTjHfOcg], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.07][0]
Corrupted index [corrupted_OzfNRRGyTIq8a1PRhLYG2w] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.07/0/index/_rqf.nvd"))]]]]


Thanks,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04a6e42a-0518-47ef-81a2-b59856a8a309%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hello Mehmet,

For two indices with problematic shards (symptoms: shard is recovering,
recovery stops and recovery is attempted on a different node), I manually
changed replica count to 1 then 2. With a big index (over 90G, I think), I
was never able to recover dual replica set, thankfully it was OK to drop
it. Upgrading to more recent ES version helped too.

HTH,

David

On Saturday, November 29, 2014 2:48:45 AM UTC-8, Mehmet Cem Güntürkün wrote:

Hey David, I have same problem now. Have you found a solution for that
problem?

26 Ağustos 2014 Salı 23:08:55 UTC+3 tarihinde David Kleiner yazdı:

Hello,

In the past couple of days I've been getting a lot of error messages
about corrupted replica shards. The primary shards come up fast after ES
process restart but replicas take a long time to come back. Sometimes it
takes a few node restarts to 'kick' the nodes to start replica shards.

ES version is 1.3.1 running on CentOS 6.5 hosted at Softlayer. It's a
3-way cluster with 4 logstash feeders hanging off it.

Here are the errors;

[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [downloader-2014.08][4] received shard failed for
[downloader-2014.08][4], node[l9-BQTHSSF-ElhgpPBZ24w], [R],
s[INITIALIZING], indexUUID [2vRrb5YlQP6MTVr1chOezg], reason [engine
failure, message [corrupted preexisting
index][CorruptIndexException[[downloader-2014.08][4] Corrupted index
[corrupted_SkU0-ZHZRxivSnGczABb_g] caused by: CorruptIndexException[codec
footer mismatch: actual footer=-1676705023 vs expected footer=-1071082520
(resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/downloader-2014.08/4/index/_k9a_es090_0.doc"))]]]]
[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.06][0] received shard failed for
[eventlog-2014.06][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [jbvChdRrRB6HTutxPvxMmQ], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.06][0]
Corrupted index [corrupted__712QIBQQqafzpBoQwZtcg] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.06/0/index/_1k4x.nvd"))]]]]
[2014-08-26 15:01:18,684][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.07][0] received shard failed for
[eventlog-2014.07][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [T4tTXkPjTaCdSVNTjHfOcg], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.07][0]
Corrupted index [corrupted_OzfNRRGyTIq8a1PRhLYG2w] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.07/0/index/_rqf.nvd"))]]]]


Thanks,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52c4fa13-32aa-4f60-bda9-c8e999ee0d2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I've had similar problems. Two things that helped:

  1. If index had more than one shard then optimizing it to one shard usually
    worked.
  2. In other case manually copying shard files from node with master shard
    to one of nodes that kept failing.

On Sunday, 30 November 2014 00:57:02 UTC+1, David Kleiner wrote:

Hello Mehmet,

For two indices with problematic shards (symptoms: shard is recovering,
recovery stops and recovery is attempted on a different node), I manually
changed replica count to 1 then 2. With a big index (over 90G, I think), I
was never able to recover dual replica set, thankfully it was OK to drop
it. Upgrading to more recent ES version helped too.

HTH,

David

On Saturday, November 29, 2014 2:48:45 AM UTC-8, Mehmet Cem Güntürkün
wrote:

Hey David, I have same problem now. Have you found a solution for that
problem?

26 Ağustos 2014 Salı 23:08:55 UTC+3 tarihinde David Kleiner yazdı:

Hello,

In the past couple of days I've been getting a lot of error messages
about corrupted replica shards. The primary shards come up fast after ES
process restart but replicas take a long time to come back. Sometimes it
takes a few node restarts to 'kick' the nodes to start replica shards.

ES version is 1.3.1 running on CentOS 6.5 hosted at Softlayer. It's a
3-way cluster with 4 logstash feeders hanging off it.

Here are the errors;

[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [downloader-2014.08][4] received shard failed for
[downloader-2014.08][4], node[l9-BQTHSSF-ElhgpPBZ24w], [R],
s[INITIALIZING], indexUUID [2vRrb5YlQP6MTVr1chOezg], reason [engine
failure, message [corrupted preexisting
index][CorruptIndexException[[downloader-2014.08][4] Corrupted index
[corrupted_SkU0-ZHZRxivSnGczABb_g] caused by: CorruptIndexException[codec
footer mismatch: actual footer=-1676705023 vs expected footer=-1071082520
(resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/downloader-2014.08/4/index/_k9a_es090_0.doc"))]]]]
[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.06][0] received shard failed for
[eventlog-2014.06][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [jbvChdRrRB6HTutxPvxMmQ], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.06][0]
Corrupted index [corrupted__712QIBQQqafzpBoQwZtcg] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.06/0/index/_1k4x.nvd"))]]]]
[2014-08-26 15:01:18,684][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.07][0] received shard failed for
[eventlog-2014.07][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [T4tTXkPjTaCdSVNTjHfOcg], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.07][0]
Corrupted index [corrupted_OzfNRRGyTIq8a1PRhLYG2w] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.07/0/index/_rqf.nvd"))]]]]


Thanks,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53898508-c45d-4908-a93f-a383941ff61e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Small mistake. 1. should be:

  1. If shard had more than one segment then optimizing it to one segment
    usually worked.

On Sunday, 30 November 2014 12:00:37 UTC+1, Jakub Podeszwik wrote:

I've had similar problems. Two things that helped:

  1. If index had more than one shard then optimizing it to one shard
    usually worked.
  2. In other case manually copying shard files from node with master shard
    to one of nodes that kept failing.

On Sunday, 30 November 2014 00:57:02 UTC+1, David Kleiner wrote:

Hello Mehmet,

For two indices with problematic shards (symptoms: shard is recovering,
recovery stops and recovery is attempted on a different node), I manually
changed replica count to 1 then 2. With a big index (over 90G, I think), I
was never able to recover dual replica set, thankfully it was OK to drop
it. Upgrading to more recent ES version helped too.

HTH,

David

On Saturday, November 29, 2014 2:48:45 AM UTC-8, Mehmet Cem Güntürkün
wrote:

Hey David, I have same problem now. Have you found a solution for that
problem?

26 Ağustos 2014 Salı 23:08:55 UTC+3 tarihinde David Kleiner yazdı:

Hello,

In the past couple of days I've been getting a lot of error messages
about corrupted replica shards. The primary shards come up fast after ES
process restart but replicas take a long time to come back. Sometimes it
takes a few node restarts to 'kick' the nodes to start replica shards.

ES version is 1.3.1 running on CentOS 6.5 hosted at Softlayer. It's a
3-way cluster with 4 logstash feeders hanging off it.

Here are the errors;

[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [downloader-2014.08][4] received shard failed for
[downloader-2014.08][4], node[l9-BQTHSSF-ElhgpPBZ24w], [R],
s[INITIALIZING], indexUUID [2vRrb5YlQP6MTVr1chOezg], reason [engine
failure, message [corrupted preexisting
index][CorruptIndexException[[downloader-2014.08][4] Corrupted index
[corrupted_SkU0-ZHZRxivSnGczABb_g] caused by: CorruptIndexException[codec
footer mismatch: actual footer=-1676705023 vs expected footer=-1071082520
(resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/downloader-2014.08/4/index/_k9a_es090_0.doc"))]]]]
[2014-08-26 15:01:18,682][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.06][0] received shard failed for
[eventlog-2014.06][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [jbvChdRrRB6HTutxPvxMmQ], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.06][0]
Corrupted index [corrupted__712QIBQQqafzpBoQwZtcg] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.06/0/index/_1k4x.nvd"))]]]]
[2014-08-26 15:01:18,684][WARN ][cluster.action.shard ] [log03 /
Salvador Dali] [eventlog-2014.07][0] received shard failed for
[eventlog-2014.07][0], node[l9-BQTHSSF-ElhgpPBZ24w], [R], s[INITIALIZING],
indexUUID [T4tTXkPjTaCdSVNTjHfOcg], reason [engine failure, message
[corrupted preexisting index][CorruptIndexException[[eventlog-2014.07][0]
Corrupted index [corrupted_OzfNRRGyTIq8a1PRhLYG2w] caused by:
CorruptIndexException[codec footer mismatch: actual footer=0 vs expected
footer=-1071082520 (resource:
NIOFSIndexInput(path="/acc/ES/NBS/nodes/0/indices/eventlog-2014.07/0/index/_rqf.nvd"))]]]]


Thanks,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bef48895-f1ec-41d3-9f3c-6009723f103b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.