Ingestion performance issues - where to start?

Hi all,

I'm having ingestion performance issues that I haven't gotten to the bottom of, and I'm quite new to the elastic stack, so I thought I'd seek advice here.

I have a cluster of 3 VMs (4 CPU/64GB RAM/500GB disk). RHEL 7.

Elasticsearch 7.8.0 is installed on all of them, and configured in a cluster (transport encrypted, http not encrypted). 26GB heap size, usually around 50% utilised

The index is in 3 shards with 2 copies (high availability was a priority)

Logstash 7.8.0 is also installed on every box, output pointing at elastic on the same box. 4GB heap size, 8 workers, 125 batch size.

Logs are round-robined through a VIP to each of the boxes.

Filter configured in logstash to use csv to pull out 66 fields

Data is on average ~400 bytes/event, the sources send it through at approx. 160Mb/s.

I’m finding that this system cannot keep up, logs buffers are building up on the data source devices. When logstash is turned on, there are TCP window resize requests (to a few hundred bytes) arriving at the data source. However, I find nothing saying anything about having to throttle incoming data in the logs. Logstash is ingesting at approx 40Mb/s.

When I turn logstash off and listen with ncat dumping straight to /dev/null, this backlog disappears because throughput skyrockets.

.

The CPU usage hangs around 70%,.

I have tried reducing the 66 fields to 8, which reduced ingest time by about a third in my lab.

Is there anywhere in this that jumps out that should be done differently? Do I have enough hardware?

The first thing to do is to try and determine whether Elasticsearch is the bottleneck. I often start looking at storage as indexing can be very I/O intensive and it sounds like CPU and heap usage are OK. Are you using local SSDs for storage? If not, what does iostat -x give on the nodes during indexing?

Thanks, Christian. I'm not using SSDs.

# iostat -x
Linux 3.10.0-957.el7.x86_64 ()        08/21/2020    _x86_64_ (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          27.14    6.88    2.71    9.78    0.00   53.49

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   460.67    0.00  311.51     0.00 12809.37    82.24     0.84    2.78    2.96    2.78   1.23  38.34

Have you perhaps considered running logstash on separate different VMs from elasticsearch, not that you can't but they will be competing for resources. Maybe I am not understanding your architecture.

And as Christian asked, what kind of storage are you using?

Also HW wise even though CPU says 70% 4 CPU for 64GB RAM plus Logstash feels light too me. You are also writing 3 copies of the data.

Agreed, as I bet it's Logstash as that high CPU might mean you are maxing out on a thread or two that's doing the work, likely in LS, though possible in ES - ES queues may show if it's the bottleneck; IO looks very good - I bet the CSV processing and 66 fields is tying up LS too much.

GREAT up front description, by the way, listing all the nodes, RAM heaps, data rates & sizes, and so on.

1 Like

One minor additional comment dissect filter is much more performant than csv filter in logstash.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.