What bottleneck am I hitting?!

This has been driving me batty for a couple of days. I can't get my ELK stack to index above 36k per second, no matter how I try to scale horizontally..

The machines: 3 Master nodes (4 core 14g mem Azure VMs), 2 Client nodes, (8 core, 56g VM), Anywhere from 4 to 20 Data nodes (20 core, 140g mem Azure VMs).

I've run data nodes with ext4 and with ZFS, using both the multiple disks and a single raid0. The winner seems to be ZFS RAID0 with Azure SSD drives, with a local SSD cache on the machine, and 70g of memory for ZFS to play with. Running zpool iostat, or normal iostat for ext4, shows that the drives are rarely under high write rates - usually only 5-10M/s each, but with rare peaks to 40+ each (every few minutes). So I don't think disk is the issue, if it was I'd expect higher iostat usage, and higher wait state in top. Maybe I'm wrong?

No VM runs above 20% CPU usage. Load isn't high either. The data nodes have 31g of mlocked hEAP, but it seems to be under control.

I'm sending data from logstash using the elasticsearch output. It doesn't seem to matter how many logstash processes I use. If I use the null output, I can get logstash to process at 90k messages or higher - as soon as I put a stock elasticsearch output, everything adds up to ~35k. I've tried anywhere fro 5 to over 20 logstash instances, with no difference in elasticsearch's indexing performance.

What am I missing? I've played with number of shards, and currently run with two shards per VM per index. A large number of shards decreased performance. I suppose I can try tweaking this further.

I've tried playing with threadpool.bulk.size and queue_size, but just made things worse. I tried giving more index buffer using indices.memory.index_buffer_size, but again, that didn't really help.

I'm well and truly stumped. Until I hit this ceiling, it was easy to scale horizontally, but it doesn't seem to matter how many machines I throw at this problem.

1 Like

And yes, I've played with -w and -b with logstash too.

And yes, I've set a longer refresh rate on the indices.

Version numbers?

Have you checked if the network is saturated?

Latest versions of everything - 2.3.3. This is all in Azure, in the same region.

Also, this is time series data. (IPFIX)

What he said. Are you network bound?

I've tried anywhere fro 5 to over 20 logstash instances, with no difference in elasticsearch's indexing performance.

If I understand correctly, then, when you run 20 instances, each instance has 1/4 the throughput as when you run 5. How do you have the instances deployed at both ends of the scale, instances per host?

Anywhere from 4 to 20 Data nodes

And with 4 data nodes, you get 36kEPS, and 20 data nodes, 36kEPS.

Have you tried 2 more client nodes?

Yes, with 4 data nodes or 20, I get ~36k. If I add logstash instances, either with different VMs or additional processes per VM, then yes, for 20 instances they'll have 1/4th the throughput of 5. Currently I'm way overbuilt on logstash indexers, because I want to scale up - so I have 25 16-core VMs basically idle. If I can get elasticsearch to scale, I expect to need that many logstash indexers. (Yes, I have a lot of data to index)

I'm going to start doing packet captures on the nodes to see if I can turn up anything on the networking side. This really has me stumped, as I've built something like this before that did scale, but that was es 1.7.

Have you tried 2 more client nodes?

Have you checked the ES logs? Merge throttling is a frequent culprit for me.

Here's what worked for me. I gave logstash 7g of heap, and used -w 16 -b 25000. I then used that 16*25000(400000) as my flush size in the elasticsearch output with an idle flush time of 30 seconds. I also increased the bulk threadpool queue size to 200. Those changes took me from 36k/s to over 500k/sec. That's pretty small messages, granted, but well over 10x performance.

1 Like

That's quite a large increase, I take it your ES nodes were under utilised as well?