This has been driving me batty for a couple of days. I can't get my ELK stack to index above 36k per second, no matter how I try to scale horizontally..
The machines: 3 Master nodes (4 core 14g mem Azure VMs), 2 Client nodes, (8 core, 56g VM), Anywhere from 4 to 20 Data nodes (20 core, 140g mem Azure VMs).
I've run data nodes with ext4 and with ZFS, using both the multiple disks and a single raid0. The winner seems to be ZFS RAID0 with Azure SSD drives, with a local SSD cache on the machine, and 70g of memory for ZFS to play with. Running zpool iostat, or normal iostat for ext4, shows that the drives are rarely under high write rates - usually only 5-10M/s each, but with rare peaks to 40+ each (every few minutes). So I don't think disk is the issue, if it was I'd expect higher iostat usage, and higher wait state in top. Maybe I'm wrong?
No VM runs above 20% CPU usage. Load isn't high either. The data nodes have 31g of mlocked hEAP, but it seems to be under control.
I'm sending data from logstash using the elasticsearch output. It doesn't seem to matter how many logstash processes I use. If I use the null output, I can get logstash to process at 90k messages or higher - as soon as I put a stock elasticsearch output, everything adds up to ~35k. I've tried anywhere fro 5 to over 20 logstash instances, with no difference in elasticsearch's indexing performance.
What am I missing? I've played with number of shards, and currently run with two shards per VM per index. A large number of shards decreased performance. I suppose I can try tweaking this further.
I've tried playing with threadpool.bulk.size and queue_size, but just made things worse. I tried giving more index buffer using indices.memory.index_buffer_size, but again, that didn't really help.
I'm well and truly stumped. Until I hit this ceiling, it was easy to scale horizontally, but it doesn't seem to matter how many machines I throw at this problem.