Continuous Disk Failure in ElasticSearch Nodes in heavy bulk indexing environments

(mohankumar) #1

Hi All,

We have maintained a Production Environment,about 17 TB Data and 40 Billion documents with 10 Boxes(Each have 24 CPU and 64 GB RAM).

For late two months we are received,a problematical area ,

kernel: sd 0:2:5:0: [sdf] Unhandled error code
kernel: sd 0:2:5:0: [sdf] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
kernel: sd 0:2:5:0: [sdf] CDB: Read(10): 28 00 45 9a 7b ff 00 01 00 00
kernel: sd 0:2:5:0: [sdf] Unhandled error code

ElasticSearch went stopped automaticaaly and shards are went to UNASSIGNED due to this disk failures.

Continously we are facing this issue and elasticsearch stopped automatically,When we are facing the issues ,we need to loss the all shards data's in that particular box.

Per day we have writing 4 Billion documents of 2 TB (approx).

My question was

1)Elasticsearch heavy indexing is the reason for disk failure?

2)A single Bad sector in a Disk(12 hard disk per box) will lead a data loss in all disks (Entire Box)


(Mark Walkom) #2

I can't see how ES could directly cause disk failures, maybe if the disks are not very high quality it could cause this to become more apparent.

And for two, that depends on how the OS handles things.

(mohankumar) #3

Thanks Mark for your update.

I have another one regarding the second one. We have not enabled replica,in this scenario. In a single box we have stored the data(with 12 hard disks).we have a failured in any one of the disk ,it would lead the failure in total shards as well entire box.Is there any way to override from this.

Can you please explain ,How the OS handles these things.Thanks in Advance.


(Mark Walkom) #4

If you only have one node then you can't do anything, adding another node and a replica will help prevent data loss due to this.

I can't explain how the OS handles these sorts of things, there are better resources on the internet for that.

(mohankumar) #5

Thanks a lot Mark.

(system) #6