Index shard got corrupted

Hi

in all our elasticsearch cluster we use this elasticsearch-cloud-aws plugin
to create the snapshots on s3 on a regular basis.

Some times we saw the shard got corrupted for an index in our elasticsearch
log.
So we try to restore it from backup and while restoring it from backup
again we see the same exception in logs which is follows

[2015-02-25 08:18:10,824][WARN ][indices.cluster ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] failed to start
shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[lst_p113_v_4_20140615_0000][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] restore failed
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] failed to restore snapshot
[listening-prod6-20150224]
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] Failed to recover index
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)
at
org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
... 6 more
[2015-02-25 08:18:10,826][WARN ][cluster.action.shard ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] sending failed
shard for [lst_p113_v_4_20140615_0000][0], node[shNgLjr8RlW7Zrk3P4UdPg],
[P], restoring[aws-prod-elasticsearch-backup:listening-prod6-20150224],
s[INITIALIZING], indexUUID [ZQKQ-6naQqeLP1Gk8IFsig], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[lst_p113_v_4_20140615_0000][0] failed
recovery]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] restore
failed]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] failed to
restore snapshot [listening-prod6-20150224]]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] Failed to
recover index]; nested: CorruptIndexException[checksum failed (hardware
problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)]; ]]

Even if we back to an older snapshot we found the same exception.

So what we did was we download all the segments files from s3 merge it and
there we found some segments were corrupted by using
org.apache.lucene.index.CheckIndex with -fix
We fixed it but we loose 5gb data.

We shared this problem with elasticsearch-cloud-aws team , They didnot give
any reply till now,

Can you guys please have a look into this issue and suggest something

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9a7b1069-22b2-401b-a40c-096eb12db937%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi, this happened to me as well but in my case I didn't see anything
related to "checksum failed (hardware problem?)". The service started with
5 shards and we started to send documents using the python driver.

The steps we took:

  1. created 3 indexes
  2. sent various json objects using curl
  3. query data
  4. delete all indexs, one by one
  5. we create a new index, (with the same name than a previous index)
  6. we sent documents using the python driver

I installed elasticsearch from the [elasticsearch-1.4] yum repository and
java-1.8.0-openjdk-1.8.0.31

Below is the backsrace:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[uptrack][1] failed to recover shard
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:287)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException:
translog corruption while reading from stream
at
org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:70)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:257)
... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No
version type match [54]
at
org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
at
org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:374)
at
org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
... 5 more
[2015-03-01 13:00:21,450][WARN ][cluster.action.shard ] [Wild Thing]
[uptrack][1] sending failed shard for [uptrack][1],
node[XWoDZtiyTh69cKSKtVZsSg], [P], s[INITIALIZING], indexUUID
[3KHKormcQWOSBvC_M5LFXA], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[uptrack][1] failed to recover shard];
nested: TranslogCorruptedException[translog corruption while reading from
stream]; nested: ElasticsearchIllegalArgumentException[No version type
match [54]]; ]]

On Monday, 2 March 2015 08:26:28 UTC-3, Sukanta Saha wrote:

Hi

in all our elasticsearch cluster we use this elasticsearch-cloud-aws
plugin to create the snapshots on s3 on a regular basis.

Some times we saw the shard got corrupted for an index in our
elasticsearch log.
So we try to restore it from backup and while restoring it from backup
again we see the same exception in logs which is follows

[2015-02-25 08:18:10,824][WARN ][indices.cluster ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] failed to start
shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[lst_p113_v_4_20140615_0000][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] restore failed
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] failed to restore snapshot
[listening-prod6-20150224]
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] Failed to recover index
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)
at
org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
... 6 more
[2015-02-25 08:18:10,826][WARN ][cluster.action.shard ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] sending failed
shard for [lst_p113_v_4_20140615_0000][0], node[shNgLjr8RlW7Zrk3P4UdPg],
[P], restoring[aws-prod-elasticsearch-backup:listening-prod6-20150224],
s[INITIALIZING], indexUUID [ZQKQ-6naQqeLP1Gk8IFsig], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[lst_p113_v_4_20140615_0000][0] failed
recovery]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] restore
failed]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] failed to
restore snapshot [listening-prod6-20150224]]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] Failed to
recover index]; nested: CorruptIndexException[checksum failed (hardware
problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)]; ]]

Even if we back to an older snapshot we found the same exception.

So what we did was we download all the segments files from s3 merge it and
there we found some segments were corrupted by using
org.apache.lucene.index.CheckIndex with -fix
We fixed it but we loose 5gb data.

We shared this problem with elasticsearch-cloud-aws team , They didnot
give any reply till now,

Can you guys please have a look into this issue and suggest something

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/45e71de9-f0bd-4381-a538-a72d926fb554%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,

I just had this problem too after upgrading from 1.2.1 to 1.4.1
Basically, if your ES was upgraded recently to 1.4.x from <1.4.0, it's a
very common
situation https://github.com/elasticsearch/elasticsearch/issues/9922
In our case we just upgraded got corrupt shards, loaded snapshots,
reindexed the delta information, got other broken shards on node restarts,
and so on... until at some point, it just stopped happening.
We also reached the point where any snapshot failed (although it worked
literally the day before).
Being insane
(http://www.quotehd.com/imagequotes/TopAuthors/albert-einstein-physicist-insanity-doing-the-same-thing-over-and-over-again-and.jpg)
helped, because at some point the snap recovery worked.

I just had the issue again, with one shard stuck in "initializing" status
when restarting the nodes after an upgrade.
Now that i've browsed hundreds of elastic search related documents, i know
better :
Basically just identify the broken shard curl -XGET
http://localhost:9200/_cluster/state?pretty=true > foo.json
The shard(s) will have "INITIALIZING" status and will have their node id
associated, which will let you know where is the blocked shard.
If you're using one replica, and the other shard is efficiently started,
you can just rm the directory containing the broken shard (the one you
executer the check index thing) and it should rebuild itself based on the
other, uncorrupted shard (from another node).

Now if you have a lot of corruption problems maybe take a look at this
https://github.com/elasticsearch/elasticsearch/pull/7580 and at the output
of your java -version.
Upgrading to java >= 1.7.55 is a requirement anyways directly from the
elastic search website (can't remember where i've seen it exactly)

Also the check index tools does not repair anything it just destroys the
index and creates a new one (from my experience).

It feels like your snapshot has nothing to do with so don't expect
anything from the AWS team.

And 1.4.0 and 1.4.1 have dreadful snapshot/restore and perm generation
bugs, so i'd avoid these two.

What are your versions?

Hope this helped.

Le lundi 2 mars 2015 12:26:28 UTC+1, Sukanta Saha a écrit :

Hi

in all our elasticsearch cluster we use this elasticsearch-cloud-aws
plugin to create the snapshots on s3 on a regular basis.

Some times we saw the shard got corrupted for an index in our
elasticsearch log.
So we try to restore it from backup and while restoring it from backup
again we see the same exception in logs which is follows

[2015-02-25 08:18:10,824][WARN ][indices.cluster ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] failed to start
shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[lst_p113_v_4_20140615_0000][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] restore failed
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] failed to restore snapshot
[listening-prod6-20150224]
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by:
org.elasticsearch.index.snapshots.IndexShardRestoreFailedException:
[lst_p113_v_4_20140615_0000][0] Failed to recover index
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed
(hardware problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)
at
org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:365)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
at
org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
... 6 more
[2015-02-25 08:18:10,826][WARN ][cluster.action.shard ]
[test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] sending failed
shard for [lst_p113_v_4_20140615_0000][0], node[shNgLjr8RlW7Zrk3P4UdPg],
[P], restoring[aws-prod-elasticsearch-backup:listening-prod6-20150224],
s[INITIALIZING], indexUUID [ZQKQ-6naQqeLP1Gk8IFsig], reason [Failed to
start shard, message
[IndexShardGatewayRecoveryException[[lst_p113_v_4_20140615_0000][0] failed
recovery]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] restore
failed]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] failed to
restore snapshot [listening-prod6-20150224]]; nested:
IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] Failed to
recover index]; nested: CorruptIndexException[checksum failed (hardware
problem?) : expected=1lvsjli actual=3awj8p
resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)]; ]]

Even if we back to an older snapshot we found the same exception.

So what we did was we download all the segments files from s3 merge it and
there we found some segments were corrupted by using
org.apache.lucene.index.CheckIndex with -fix
We fixed it but we loose 5gb data.

We shared this problem with elasticsearch-cloud-aws team , They didnot
give any reply till now,

Can you guys please have a look into this issue and suggest something

Thanks

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a9ea265d-aa70-4c2d-b117-0dd7e8b78ad6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.