Elasticsearch version (bin/elasticsearch --version):
JVM version (java -version):
Description of the problem including expected versus actual behavior:
In production enviroment, we have encounter hardware failure serveral times, which cause one or more nodes to half-dead, then the whole cluster hang.
3 nodes: 24 Cores, 128GB memory, 31GB heap
Steps to reproduce:
We use tc cmd to simulate the hardware failure and reproduce the problem:
- start the cluster
- do some heavy index(50%~ CPU)
- use tc cmd to randomly drop packet:
tc qdisc add dev eth0 root netem loss 50%
Has anyone encountered similar problem? Any idea to tolerate such hardware failure?
Thanks : )