Continuous Disk Failure in ElasticSearch Nodes in heavy bulk indexing environments

moni15moni · September 18, 2015, 12:25pm

Hi All,

We have maintained a Production Environment,about 17 TB Data and 40 Billion documents with 10 Boxes(Each have 24 CPU and 64 GB RAM).

For late two months we are received,a problematical area ,

kernel: sd 0:2:5:0: [sdf] Unhandled error code
kernel: sd 0:2:5:0: [sdf] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
kernel: sd 0:2:5:0: [sdf] CDB: Read(10): 28 00 45 9a 7b ff 00 01 00 00
kernel: sd 0:2:5:0: [sdf] Unhandled error code

ElasticSearch went stopped automaticaaly and shards are went to UNASSIGNED due to this disk failures.

Continously we are facing this issue and elasticsearch stopped automatically,When we are facing the issues ,we need to loss the all shards data's in that particular box.

Per day we have writing 4 Billion documents of 2 TB (approx).

My question was

1)Elasticsearch heavy indexing is the reason for disk failure?

2)A single Bad sector in a Disk(12 hard disk per box) will lead a data loss in all disks (Entire Box)

Thanks
Moni

warkolm · September 19, 2015, 1:33am

I can't see how ES could directly cause disk failures, maybe if the disks are not very high quality it could cause this to become more apparent.

And for two, that depends on how the OS handles things.

moni15moni · September 21, 2015, 6:39am

Thanks Mark for your update.

I have another one regarding the second one. We have not enabled replica,in this scenario. In a single box we have stored the data(with 12 hard disks).we have a failured in any one of the disk ,it would lead the failure in total shards as well entire box.Is there any way to override from this.

Can you please explain ,How the OS handles these things.Thanks in Advance.

Moni

warkolm · September 21, 2015, 10:27pm

If you only have one node then you can't do anything, adding another node and a replica will help prevent data loss due to this.

I can't explain how the OS handles these sorts of things, there are better resources on the internet for that.

moni15moni · September 22, 2015, 4:57am

Thanks a lot Mark.

Topic		Replies	Views
Elasticsearch data node does not failover when data disk fails Elasticsearch	3	1416	July 5, 2017
Node startup failed after one disk fails Elasticsearch	1	697	July 6, 2018
How to handle system failures in Elasticsearch cluster Elasticsearch	5	409	July 6, 2017
Replace failing disks on a single node Elasticsearch	4	1431	July 6, 2017
Locking a shard to one data path Elasticsearch	2	568	July 6, 2017

Continuous Disk Failure in ElasticSearch Nodes in heavy bulk indexing environments

Related topics