Hello,
I've got a cluster, 140 nodes, 128GB ram each, currently with a 10GB heap, was 30GB but found GC took so long it caused boxes to regularly disconnect,
3 dedicated masters, these have 128GB RAM with a 20GB ES heap.
Day indexes.
40 shards per index.
365 day retention is intended, but currently only at around 4 months.
Circa 2 to 3 billion records per day.
Records have on average 30 fields, but can be up to 40
Around 600GB per index.
Plus 1 replica.
2x10gbit ethernet data (bonded) Lan on all nodes.
We had it all working and was running a spark job to read from hdfs and write to ES at around 150k events per second for about a month.
It reached around 300 billion records, but then a single node fell over (a physical memory failure caused the box to die) and since then the index rate is struggling to go over 20k/second. I have fixed the node and it's back in service but the bad rates persist. I know it's not the spark cluster running slowly as I have other spark jobs running fine with higher I/O.
Any suggestions to find the bottleneck? Or could it have been a coincidence and have I just hit the limits of having too many indexes/shards and need to scale up with more nodes?
Most boxes heap are showing 50-60% used though a few are high, 90% +.
Disk is at 20% used.
Not seeing rejections on the threadpools, not seeing any nodes dying, tried closing some old indexes to see if the speed increased and didn't see anything change.
Queries of data - maybe 20 queries per day across the full range, pulling back on average 10k events.
I'm not scared of trashing and starting again if anyone has any previous experience of this scale. So don't worry if it's a destructive fix, my source is on HDFS so I can restart the spark ingests.
Grateful for any help!