currently I have 5 dedicated k8 worker nodes for ES cluster, each node is 8cpu/ 32G memory.
5 ES pods are deployed through ECK operator on k8, all ES pods have all roles (no separate master, data etc.) eck is excellent in managing ES!
each ES pod has 7cpu and 16G heap size and has 300GB provisioned iops (4500 iops per volume) EBS volume attached.
zipkin sends application traces to ES, but there are many traces dropped because ES write thread pool is rejecting the writes. CPU, heap looks good at ES side but the disk is saturating so ES can't perform well.
I need to increase the iops from 4500 to something higher, eg. 9000 but that is expensive so I am looking for better alternatives because 9000 iops will also hit to its limit at some point.
As per https://www.elastic.co/guide/en/elasticsearch/plugins/master/cloud-aws-best-practices.html#_storage instance store is recommended, but since this is ephemeral volume, I fear of data loss if 2 out of 5 ES pods go down due to underlying worker nods issue.
On the other hand if same thing happens with EBS volumes attached ES cluster, then there will be no (or less) data loss and cluster can be recovered when worker nodes come back online.
Can someone guide and share their experiences on using instance ephemeral store?