What bottleneck am I hitting?!

Janet · July 6, 2016, 6:31pm

This has been driving me batty for a couple of days. I can't get my ELK stack to index above 36k per second, no matter how I try to scale horizontally..

The machines: 3 Master nodes (4 core 14g mem Azure VMs), 2 Client nodes, (8 core, 56g VM), Anywhere from 4 to 20 Data nodes (20 core, 140g mem Azure VMs).

I've run data nodes with ext4 and with ZFS, using both the multiple disks and a single raid0. The winner seems to be ZFS RAID0 with Azure SSD drives, with a local SSD cache on the machine, and 70g of memory for ZFS to play with. Running zpool iostat, or normal iostat for ext4, shows that the drives are rarely under high write rates - usually only 5-10M/s each, but with rare peaks to 40+ each (every few minutes). So I don't think disk is the issue, if it was I'd expect higher iostat usage, and higher wait state in top. Maybe I'm wrong?

No VM runs above 20% CPU usage. Load isn't high either. The data nodes have 31g of mlocked hEAP, but it seems to be under control.

I'm sending data from logstash using the elasticsearch output. It doesn't seem to matter how many logstash processes I use. If I use the null output, I can get logstash to process at 90k messages or higher - as soon as I put a stock elasticsearch output, everything adds up to ~35k. I've tried anywhere fro 5 to over 20 logstash instances, with no difference in elasticsearch's indexing performance.

What am I missing? I've played with number of shards, and currently run with two shards per VM per index. A large number of shards decreased performance. I suppose I can try tweaking this further.

I've tried playing with threadpool.bulk.size and queue_size, but just made things worse. I tried giving more index buffer using indices.memory.index_buffer_size, but again, that didn't really help.

I'm well and truly stumped. Until I hit this ceiling, it was easy to scale horizontally, but it doesn't seem to matter how many machines I throw at this problem.

Janet · July 6, 2016, 6:35pm

And yes, I've played with -w and -b with logstash too.

Janet · July 6, 2016, 7:56pm

And yes, I've set a longer refresh rate on the indices.

Glen_Smith · July 6, 2016, 9:39pm

Version numbers?

jprante · July 6, 2016, 10:07pm

Have you checked if the network is saturated?

Janet · July 6, 2016, 11:02pm

Latest versions of everything - 2.3.3. This is all in Azure, in the same region.

Janet · July 6, 2016, 11:05pm

Also, this is time series data. (IPFIX)

Glen_Smith · July 7, 2016, 1:08am

What he said. Are you network bound?

I've tried anywhere fro 5 to over 20 logstash instances, with no difference in elasticsearch's indexing performance.

If I understand correctly, then, when you run 20 instances, each instance has 1/4 the throughput as when you run 5. How do you have the instances deployed at both ends of the scale, instances per host?

Anywhere from 4 to 20 Data nodes

And with 4 data nodes, you get 36kEPS, and 20 data nodes, 36kEPS.

Have you tried 2 more client nodes?

Janet · July 7, 2016, 2:05am

Yes, with 4 data nodes or 20, I get ~36k. If I add logstash instances, either with different VMs or additional processes per VM, then yes, for 20 instances they'll have 1/4th the throughput of 5. Currently I'm way overbuilt on logstash indexers, because I want to scale up - so I have 25 16-core VMs basically idle. If I can get elasticsearch to scale, I expect to need that many logstash indexers. (Yes, I have a lot of data to index)

I'm going to start doing packet captures on the nodes to see if I can turn up anything on the networking side. This really has me stumped, as I've built something like this before that did scale, but that was es 1.7.

Glen_Smith · July 7, 2016, 2:07am

Have you tried 2 more client nodes?

unknownunknown · July 13, 2016, 7:48pm

Have you checked the ES logs? Merge throttling is a frequent culprit for me.

Janet · July 14, 2016, 6:25am

Here's what worked for me. I gave logstash 7g of heap, and used -w 16 -b 25000. I then used that 16*25000(400000) as my flush size in the elasticsearch output with an idle flush time of 30 seconds. I also increased the bulk threadpool queue size to 200. Those changes took me from 36k/s to over 500k/sec. That's pretty small messages, granted, but well over 10x performance.

warkolm · August 4, 2016, 5:30am

That's quite a large increase, I take it your ES nodes were under utilised as well?

Topic		Replies	Views
Cluster from virtual machines Elasticsearch	5	770	July 5, 2017
Real world elastic sizing Elasticsearch	5	938	July 5, 2017
How do i know where is the bottleneck in ELK Elastic Stack	4	507	November 4, 2022
Improving Elasticsearach ingest capacity Elasticsearch	7	111	June 20, 2024
ElasticSearch performance trouble when indexing data Elasticsearch	11	4345	April 28, 2021

What bottleneck am I hitting?!

Related topics