Debugging performance decrease after a node fault

Hello,
I've got a cluster, 140 nodes, 128GB ram each, currently with a 10GB heap, was 30GB but found GC took so long it caused boxes to regularly disconnect,
3 dedicated masters, these have 128GB RAM with a 20GB ES heap.
Day indexes.
40 shards per index.
365 day retention is intended, but currently only at around 4 months.
Circa 2 to 3 billion records per day.
Records have on average 30 fields, but can be up to 40
Around 600GB per index.
Plus 1 replica.
2x10gbit ethernet data (bonded) Lan on all nodes.

We had it all working and was running a spark job to read from hdfs and write to ES at around 150k events per second for about a month.
It reached around 300 billion records, but then a single node fell over (a physical memory failure caused the box to die) and since then the index rate is struggling to go over 20k/second. I have fixed the node and it's back in service but the bad rates persist. I know it's not the spark cluster running slowly as I have other spark jobs running fine with higher I/O.

Any suggestions to find the bottleneck? Or could it have been a coincidence and have I just hit the limits of having too many indexes/shards and need to scale up with more nodes?
Most boxes heap are showing 50-60% used though a few are high, 90% +.
Disk is at 20% used.

Not seeing rejections on the threadpools, not seeing any nodes dying, tried closing some old indexes to see if the speed increased and didn't see anything change.

Queries of data - maybe 20 queries per day across the full range, pulling back on average 10k events.

I'm not scared of trashing and starting again if anyone has any previous experience of this scale. So don't worry if it's a destructive fix, my source is on HDFS so I can restart the spark ingests.

Grateful for any help!

1 Like

Which version of Elasticsearch are you using?

How many daily indices are you creating? Am I reading this correctly that each daily index has 40 primary and 40 replica shards with an average shards size around 15GB?

Are all shards currently assigned?

Do you have any monitoring installed? If so, what does heap usage over time look like on the nodes?

6.0.0
I currently have 150 indexes.
And yes 40 primary + 40 replicas.

All shards assigned, cluster health shows 100%.
Average size, 15GB sounds right (I can't access the cluster from home so can't verify).

I have kibana and basic monitoring xpack license.
Heap on the boxes showing high heap (90%) usage nodes is pretty flat. On the remaining nodes (from my memory) it was going up and down, but not massively. Average heap usage was 60% across the cluster when I looked on Friday.

Thanks for coming back so quickly

Should have said, I was on 5.6 before this. Saw a similar slow down, but took the opportunity to erase everything and shift to 6.0.0 rather than migrate.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.