Corrupted shard after optimize

I did an optimize on this index and it looks like it caused a shard to
become corrupted. Or maybe the optimize just brought the shard corruption
to light?

On the node that reported the corrupted shard I tried shutting it down,
moving the shard out and then restarting. Unfortunately the next node that
got that shard then started with the same corruption issues. The errors:

Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][indices.cluster ] [Meteorite II] [1-2013][0] failed to start
shard
Mar 24 01:40:17 localhost
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1-2013][0] failed to fetch index version after copying it over
Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][cluster.action.shard ] [Meteorite II] [1-2013][0] sending failed
shard for [1-2013][0], node[ZzXsIZCsTyWD2emFuU0idg], [P], s[INITIALIZING],
indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[1-2013][0] failed to fetch index
version after copying it over]; nested: CorruptIndexException[[1-2013][0]
Corrupted index [corrupted_OahNymObSTyBzCCPu1FuJA] caused by:
CorruptIndexException[docs out of order (1493829 <= 1493874 ) (docOut:
org.apache.lucene.store.RateLimitedIndexOutput@2901a3e1)]]; ]]

I tried using CheckIndex, but had this issue:

java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names: [Pulsing41,
SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41,
FST41, FSTOrd41, Lucene40, Lucene41]

When running with:

java -cp
/usr/share/elasticsearch/lib/lucene-codecs-4.9.1.jar:/usr/share/elasticsearch/lib/lucene-core-4.9.1.jar
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex

I'm not a java programmer so after I tried other classpath combinations I
was out of ideas.

Any tips? Looking at _cat/shards the replica is currently marked
"unassigned" while the primary is "initializing". Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hmm, not good.

Which version of ES? Do you have a full stack trace for the exception?

To run CheckIndex you need to add all ES jars to the classpath. It's
easiest to just use a wildcard for this, e.g.:

java -cp "/path/to/es-install/lib/*" org.apache.lucene.index.CheckIndex
...

Make sure you have the double quotes so the shell does not expand that
wildcard!

Mike McCandless

On Mon, Mar 23, 2015 at 9:50 PM, mjdude5@gmail.com wrote:

I did an optimize on this index and it looks like it caused a shard to
become corrupted. Or maybe the optimize just brought the shard corruption
to light?

On the node that reported the corrupted shard I tried shutting it down,
moving the shard out and then restarting. Unfortunately the next node that
got that shard then started with the same corruption issues. The errors:

Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][indices.cluster ] [Meteorite II] [1-2013][0] failed to start
shard
Mar 24 01:40:17 localhost
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1-2013][0] failed to fetch index version after copying it over
Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][cluster.action.shard ] [Meteorite II] [1-2013][0] sending failed
shard for [1-2013][0], node[ZzXsIZCsTyWD2emFuU0idg], [P], s[INITIALIZING],
indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[1-2013][0] failed to fetch index
version after copying it over]; nested: CorruptIndexException[[1-2013][0]
Corrupted index [corrupted_OahNymObSTyBzCCPu1FuJA] caused by:
CorruptIndexException[docs out of order (1493829 <= 1493874 ) (docOut:
org.apache.lucene.store.RateLimitedIndexOutput@2901a3e1)]]; ]]

I tried using CheckIndex, but had this issue:

java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names: [Pulsing41,
SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41,
FST41, FSTOrd41, Lucene40, Lucene41]

When running with:

java -cp
/usr/share/elasticsearch/lib/lucene-codecs-4.9.1.jar:/usr/share/elasticsearch/lib/lucene-core-4.9.1.jar
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex

I'm not a java programmer so after I tried other classpath combinations I
was out of ideas.

Any tips? Looking at _cat/shards the replica is currently marked
"unassigned" while the primary is "initializing". Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKHUQPhMOJWkN9p_En%2BWDM98bEDHSWTi36B_TcQsZSw%2BBKorYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the CheckIndex info, that worked! It looks like only one of the
segments in that shard has issues:

1 of 20: name=_1om docCount=216683
codec=Lucene3x
compound=false
numFiles=10
size (MB)=5,111.421
diagnostics = {os=Linux, os.version=3.5.7, mergeFactor=7, source=merge,
lucene.version=3.6.0 1310449 - rmuir - 2012-04-06 11:31:16, os.arch=amd64,
mergeMaxNumSegments=-1, java.version=1.6.0_26, java.vendor=Sun Microsystems
Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [31 fields]
test: field norms.........OK [20 fields]
test: terms, freq, prox...ERROR: java.lang.AssertionError:
index=216690, numBits=216683
java.lang.AssertionError: index=216690, numBits=216683
at org.apache.lucene.util.FixedBitSet.set(FixedBitSet.java:252)
at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:932)
at
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1325)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:631)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)
test: stored fields.......OK [3033562 total field count; avg 14 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.RuntimeException: Term Index test failed
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:646)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)

This is on ES 1.3.4, but the index I was running optimize on was likely
created back in 0.9 or 1.0.

On Tuesday, March 24, 2015 at 5:27:04 AM UTC-4, Michael McCandless wrote:

Hmm, not good.

Which version of ES? Do you have a full stack trace for the exception?

To run CheckIndex you need to add all ES jars to the classpath. It's
easiest to just use a wildcard for this, e.g.:

java -cp "/path/to/es-install/lib/*" org.apache.lucene.index.CheckIndex
...

Make sure you have the double quotes so the shell does not expand that
wildcard!

Mike McCandless

On Mon, Mar 23, 2015 at 9:50 PM, <mjd...@gmail.com <javascript:>> wrote:

I did an optimize on this index and it looks like it caused a shard to
become corrupted. Or maybe the optimize just brought the shard corruption
to light?

On the node that reported the corrupted shard I tried shutting it down,
moving the shard out and then restarting. Unfortunately the next node that
got that shard then started with the same corruption issues. The errors:

Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][indices.cluster ] [Meteorite II] [1-2013][0] failed to start
shard
Mar 24 01:40:17 localhost
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1-2013][0] failed to fetch index version after copying it over
Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][cluster.action.shard ] [Meteorite II] [1-2013][0] sending failed
shard for [1-2013][0], node[ZzXsIZCsTyWD2emFuU0idg], [P], s[INITIALIZING],
indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[1-2013][0] failed to fetch index
version after copying it over]; nested: CorruptIndexException[[1-2013][0]
Corrupted index [corrupted_OahNymObSTyBzCCPu1FuJA] caused by:
CorruptIndexException[docs out of order (1493829 <= 1493874 ) (docOut:
org.apache.lucene.store.RateLimitedIndexOutput@2901a3e1)]]; ]]

I tried using CheckIndex, but had this issue:

java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names: [Pulsing41,
SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41,
FST41, FSTOrd41, Lucene40, Lucene41]

When running with:

java -cp
/usr/share/elasticsearch/lib/lucene-codecs-4.9.1.jar:/usr/share/elasticsearch/lib/lucene-core-4.9.1.jar
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex

I'm not a java programmer so after I tried other classpath combinations I
was out of ideas.

Any tips? Looking at _cat/shards the replica is currently marked
"unassigned" while the primary is "initializing". Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4cf24288-a7f7-4b3a-88b2-11181fe93d3f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Quick followup question, is it safe to run -fix while ES is also running on
the node? Understanding that some documents will be lost.

On Tuesday, March 24, 2015 at 10:24:26 AM UTC-4, mjd...@gmail.com wrote:

Thanks for the CheckIndex info, that worked! It looks like only one of
the segments in that shard has issues:

1 of 20: name=_1om docCount=216683
codec=Lucene3x
compound=false
numFiles=10
size (MB)=5,111.421
diagnostics = {os=Linux, os.version=3.5.7, mergeFactor=7,
source=merge, lucene.version=3.6.0 1310449 - rmuir - 2012-04-06 11:31:16,
os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_26,
java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [31 fields]
test: field norms.........OK [20 fields]
test: terms, freq, prox...ERROR: java.lang.AssertionError:
index=216690, numBits=216683
java.lang.AssertionError: index=216690, numBits=216683
at org.apache.lucene.util.FixedBitSet.set(FixedBitSet.java:252)
at
org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:932)
at
org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:1325)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:631)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)
test: stored fields.......OK [3033562 total field count; avg 14 fields
per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq
vector fields per doc]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC;
0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
FAILED
WARNING: fixIndex() would remove reference to this segment; full
exception:
java.lang.RuntimeException: Term Index test failed
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:646)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:2051)

This is on ES 1.3.4, but the index I was running optimize on was likely
created back in 0.9 or 1.0.

On Tuesday, March 24, 2015 at 5:27:04 AM UTC-4, Michael McCandless wrote:

Hmm, not good.

Which version of ES? Do you have a full stack trace for the exception?

To run CheckIndex you need to add all ES jars to the classpath. It's
easiest to just use a wildcard for this, e.g.:

java -cp "/path/to/es-install/lib/*" org.apache.lucene.index.CheckIndex
...

Make sure you have the double quotes so the shell does not expand that
wildcard!

Mike McCandless

On Mon, Mar 23, 2015 at 9:50 PM, mjd...@gmail.com wrote:

I did an optimize on this index and it looks like it caused a shard to
become corrupted. Or maybe the optimize just brought the shard corruption
to light?

On the node that reported the corrupted shard I tried shutting it down,
moving the shard out and then restarting. Unfortunately the next node that
got that shard then started with the same corruption issues. The errors:

Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][indices.cluster ] [Meteorite II] [1-2013][0] failed to start
shard
Mar 24 01:40:17 localhost
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[1-2013][0] failed to fetch index version after copying it over
Mar 24 01:40:17 localhost elasticsearch: [bma.0][WARN
][cluster.action.shard ] [Meteorite II] [1-2013][0] sending failed
shard for [1-2013][0], node[ZzXsIZCsTyWD2emFuU0idg], [P], s[INITIALIZING],
indexUUID [na], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[1-2013][0] failed to fetch index
version after copying it over]; nested: CorruptIndexException[[1-2013][0]
Corrupted index [corrupted_OahNymObSTyBzCCPu1FuJA] caused by:
CorruptIndexException[docs out of order (1493829 <= 1493874 ) (docOut:
org.apache.lucene.store.RateLimitedIndexOutput@2901a3e1)]]; ]]

I tried using CheckIndex, but had this issue:

java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names: [Pulsing41,
SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41,
FST41, FSTOrd41, Lucene40, Lucene41]

When running with:

java -cp
/usr/share/elasticsearch/lib/lucene-codecs-4.9.1.jar:/usr/share/elasticsearch/lib/lucene-core-4.9.1.jar
-ea:org.apache.lucene... org.apache.lucene.index.CheckIndex

I'm not a java programmer so after I tried other classpath combinations
I was out of ideas.

Any tips? Looking at _cat/shards the replica is currently marked
"unassigned" while the primary is "initializing". Thanks!

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/31fa3d97-02fa-4d1c-b507-d413051f2ea3%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/95617d13-13fa-4b36-86ec-cc60a37d54cd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.