I'm using Elastic 7.17.3 in a 3 Node Cluster on Windows - we have Redis as a ingest and cache database and running logstash to transfer daten from redis to the cluster - so far so good.
Now i'm encoutering issues like destroyed SSD drives every now and then ... we've had the cluster no running for almost 2 years, then one SSD got corrupted and was done - we replaced it, another one went down 2 month later ... and then the climax - two nodes went down within a day ... so far so not soo good ...
we've analyzed the harddisks and got the following data (read with crystaldiskinfo):
This disk is readable although it can not be used any longer since we can not format or delete data on this SSD.
This behavior is pretty much the same on all other SSD's which corrupted over time.
So my question is, what are we doing wrong with this cluster? Can Elastic somehow be setup to not make that huge amout of writes or is this normal for a cluster of this size? Has anybody else encountered such an issue or similar?
our data (6 indices) is approx. 800-900 GB all in all so not really that much ...
we are constantly importing data with logstash - not too much but its more or less constant 365. ist this a problem or how should we overcome this issue - we've had a single node cluster where we duplicating the data and there it is no problem at all - maybe the rebalancing and replicating of the data makes thing different and also leads to this huge amount of data written? i don't know to be honest ...
we are only importing data - we've a redis cache where logstash imports all keys to Elasticsearch for further processing - for now no data is deleted. could it be that we've to reconfigure logstash? we are not doing any conversions or so so - just transfer from redis to elastic.
green open .geoip_databases AwH3KKrtRxScQSH4yMPHcA 1 0 40 40 37.7mb 37.7mb
green open .kibana_task_manager_7.17.3_001 dN_O0elCSYSJxllgWZSvgA 1 0 17 452473 128.9mb 128.9mb
green open .apm-custom-link akvx0XUDQdWFL0HpYeJXZw 1 0 0 0 226b 226b
yellow open parallel _SrV8OE3RlynylAIGLymWw 1 1 15978223 0 2gb 2gb
green open .apm-agent-configuration XktJLBlrSNGOge8Su9_H7A 1 0 0 0 226b 226b
yellow open ethercatdata ZdNsBIFuSnmyJ4vCjojbcA 1 1 3228353 0 854.8mb 854.8mb
green open .async-search wwlChbkZTQecHyphNDeFGw 1 0 0 0 252b 252b
yellow open batchdata TljGnOnmQd-5ZSPyKzJ7LA 1 1 15978239 0 1.6gb 1.6gb
yellow open modbusdata 7tWIQap8RXyWtI0jl9Xdgw 1 1 3226520 0 915.3mb 915.3mb
yellow open criodata VneVTPKLSh64KJ2zJ6cNIQ 1 1 3228355 0 1.5gb 1.5gb
green open .kibana_7.17.3_001 3GoRaHAWSTCtVX8MUAaiRg 1 0 21 6 2.4mb 2.4mb
There is only one Node (in a single Node Cluster) available, the others are down because of the mentioned problems - one Node would be operable but sind it's the last Node in a 3 Node Cluster i didn't manage to get it up and running again - haven't tried the Node Tools yet!
It does not look like you are updating or deleting any data and the indices are quite small so I am not sure what would result in so much writes. Let's see if anyone else have any ideas or suggestions.
I think a good next step is seeing what is actually using the disk here, and confirm that Elasticsearch is really the issue. You can use the Metricbeat System Diskio module, to track the disk usage. I don't think it directly shows what processes are using the disk, but should at least give you an idea. Might need to look at some other tools to see at the process level what processes are using the disk.
Note: If you can, I'd recommend sending this monitoring data to a secondary Elasticsearch cluster on different nodes to:
Not affect the performance of your current cluster
Not adversely affect the diskio metrics by collecting data then writing data to the same disks, thus further increasing the disk usage.
Edit: To further contextualize this question, I have a cluster which processes ~1TB/day across 6 hot/content nodes. Each node has its own backing SSD, and has been running ~1.25 years, and each SSD only has ~1.4PB written to it. So, Elasticsearch at your scale, writing 340PB is kind of insane. (Note: I run on Linux/Kubernetes, so the infrastructure isn't the same, but unless there is some bug on Windows, I doubt Elasticsearch would be the issue here.
Thank you Ben, i totally agree - i'm using Elastic for a while now (mostly on smaller Clusters) and never had such an issue - but this one drives me insane. It might be, and this is something we've to check further, that the batch of SSD's is some kind of problematic since it's the third SSD we've got with such an issue. Metricbeat is a good idea, i'll set it up as you said since we are going to split from one cluster wit 3 nodes to 3 clusters with one node ... i think this should solve the purpose and wonder why we haven't done it before!