Elasticsearch shard corrupted

Selvam_ayyanar · March 18, 2017, 11:52am

Hi All,

We are getting below error on our clusters, the shard number 2 goes unassigned.

[2017-03-18 04:17:37,072][WARN ][indices.cluster ] [node-1] [index1][2] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index1][2] failed to fetch index version after copying it over
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: [index1][2] Corrupted index [corrupted_edCfNV3vTemEXmU0w_bSDQ] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1oapn8l actual=16cchb4 (resource=name [_dvzo_Lucene49_0.dvd], length [51695646], checksum [1oapn8l], writtenBy [LUCENE_4_9])]
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:434)
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:419)
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:119)
... 4 more

help on this to debug further.

we are using elasticsearch -1.3.7
Please don't say upgrade your cluster, already we are on the upgrade progress.

dadoonet · March 18, 2017, 12:14pm

Then I'm speechless.

Can you reindex?

Old versions were not having checksum so when moving shards or merging it could happen that.

Definitely upgrading and keeping at least with the most recent versions of the major version you are using would have help to avoid that or at least to discover sooner that kind of problem.

May be you could try to start a 5.2 cluster and see if reindex from remote API could help you.

Good luck.

Selvam_ayyanar · March 18, 2017, 12:19pm

Hi @dadoonet

I found the issue, this is due to satacable / disk

we are getting below on dmesg and /var/log/messages

ata1.00: failed command: READ FPDMA QUEUED
ata1.00: cmd 60/00:b8:e0:10:21/01:00:66:00:00/40 tag 23 ncq 131072 in
res 40/00:b0:50:25:25/00:00:36:00:00/40 Emask 0x1 (device error)
ata1.00: status: { DRDY }
ata1.00: both IDENTIFYs aborted, assuming NODEV
ata1.00: revalidation failed (errno=-2)
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Aborted Command [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
36 25 25 50
sd 0:0:0:0: [sda] Add. Sense: No additional sense information
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 66 77 b0 08 00 00 f8 00
end_request: I/O error, dev sda, sector 1719119880
ata1: EH complete
EXT4-fs error (device sda3): __ext4_get_inode_loc: unable to read inode block - inode=2494016, block=9961731
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata1.00: irq_stat 0x40000001
ata1.00: failed command: READ DMA EXT
ata1.00: cmd 25/00:f0:10:b0:77/00:00:66:00:00/e0 tag 19 dma 122880 in
res 51/40:00:c8:b0:77/00:00:66:00:00/00 Emask 0x9 (media error)
ata1.00: status: { DRDY ERR }
ata1.00: error: { UNC }
ata1.00: configured for UDMA/133
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor]
Descriptor sense data with sense descriptors (in hex):
72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
66 77 b0 c8
sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed
sd 0:0:0:0: [sda] CDB: Read(10): 28 00 66 77 b0 10 00 00 f0 00
end_request: I/O error, dev sda, sector 1719120072
ata1: EH complete

=======================

Mar 12 03:21:13 SERVER_NAME kernel: end_request: I/O error, dev sda, sector 1719119872
Mar 12 03:21:13 SERVER_NAME kernel: ata1: EH complete
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: exception Emask 0x0 SAct 0x300 SErr 0x0 action 0x0
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: irq_stat 0x40000008
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: failed command: READ FPDMA QUEUED
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: cmd 60/f8:40:08:b0:77/00:00:66:00:00/40 tag 8 ncq 126976 in
Mar 12 03:21:13 SERVER_NAME kernel: res 41/40:f8:c8:b0:77/00:00:66:00:00/00 Emask 0x409 (media error)
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: status: { DRDY ERR }
Mar 12 03:21:13 SERVER_NAME kernel: ata1.00: error: { UNC }

Balamurali · March 20, 2017, 6:27am

Hi all,

Is this is reason for that shard corruption ???

Balamurali · March 29, 2017, 12:25pm

Hi all,

Any update on this?

dadoonet · March 29, 2017, 12:42pm

Most likely.

A hard drive which is dying can probably corrupt files. And he said:

I found the issue, this is due to satacable / disk

system · April 26, 2017, 12:42pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Corrupt index, checksum failed Elasticsearch	1	1057	July 6, 2017
Index corruption on cluster restart Elasticsearch	3	1318	July 6, 2017
Elastic shard corrupted and unassigned Elasticsearch	2	396	October 18, 2019
Recovering From Corrupted Shard Following Upgrade to 1.3.1 Elasticsearch	2	494	July 6, 2017
Upgrade from 1.2.1 to 1.4.2 and indices/shards corrupted Elasticsearch	1	359	July 6, 2017

Elasticsearch shard corrupted

Related topics