For the last year Ive been running elastic on a single host as a POC of sorts. Runs great but is out of disk space. It certainly keeps up with any load that I throw at it.
24 cores
64 Gb RAM
Recently I received 5 shiny new Dell servers to use for kubernetes. Each has 56 cores and 64Gb of RAM. I got k8 running and looked about for dockerfiles for elastic. Found the 'official' versions and found some k8 config files that looked good enough.
Got the master nodes running (3)
Got the data nodes running (5)
Got 5 ingest nodes running
I open the taps on my logstash processes (pulling from kafka) and quickly problems surface. The management and ingest nodes are pretty quiet but the data nodes are GCing pretty heavily. I tried various settings between 8 and 32 Gb for Xmx/Xms but the end results are the same: this system can't keep up with the load.
Connecting to the k8 containers I see that none are using more than 150% of CPU time. My original system would commonly hit 50-100% per core.
The biggest difference I see between these data nodes and my one node original system is that I was using G1GC whereas the 'official' docker images use UseConcMarkSweepGC. Early days with my one node setup I ran into OOM kills until I changed the GC setup.
Clearly elastic.co ppl know their business. So why am I struggling so with this setup? What params can I check to see whats misconfigured?