I'm having ingestion performance issues that I haven't gotten to the bottom of, and I'm quite new to the elastic stack, so I thought I'd seek advice here.
I have a cluster of 3 VMs (4 CPU/64GB RAM/500GB disk). RHEL 7.
Elasticsearch 7.8.0 is installed on all of them, and configured in a cluster (transport encrypted, http not encrypted). 26GB heap size, usually around 50% utilised
The index is in 3 shards with 2 copies (high availability was a priority)
Logstash 7.8.0 is also installed on every box, output pointing at elastic on the same box. 4GB heap size, 8 workers, 125 batch size.
Logs are round-robined through a VIP to each of the boxes.
Filter configured in logstash to use csv to pull out 66 fields
Data is on average ~400 bytes/event, the sources send it through at approx. 160Mb/s.
I’m finding that this system cannot keep up, logs buffers are building up on the data source devices. When logstash is turned on, there are TCP window resize requests (to a few hundred bytes) arriving at the data source. However, I find nothing saying anything about having to throttle incoming data in the logs. Logstash is ingesting at approx 40Mb/s.
When I turn logstash off and listen with ncat dumping straight to /dev/null, this backlog disappears because throughput skyrockets.
The CPU usage hangs around 70%,.
I have tried reducing the 66 fields to 8, which reduced ingest time by about a third in my lab.
Is there anywhere in this that jumps out that should be done differently? Do I have enough hardware?