I am trying to do some performance tuning on my environment.
Some initial details:
- 5 Node cluster - VMs
- Each VM has 8 cores
- Each VM has 16GB of RAM
- One Elasticsearch node per VM. Each ES node has 8GB of RAM allocated and writes to a 'local' disk (as local as I can get from the internal cloud team where I work)
I am collecting from between 50 and 60 servers with Logstash instances shipping to one of 3 Redis instances via a VIP. I get fairly even distribution of data. The Redis instances are not clustered.
From each Redis I have 1 logstash agent that pulls off of that one Redis, does filtering logic on the records, and inserts into Elasticsearch. The output Elasticsearch plugin is configured with all 5 hosts in the 'hosts' parameter.
The 3 redis instances are on different servers, BUT, are co-located on the Elasticsearch nodes.
Same with the Logstash indexers.
------------* ------------* ------------* ------------* ------------*
| |* | |* | |* | |* | |*
| ____ |* | ____ |* | ____ |* | |* | |*
| | Rd | |* | | Rd | |* | | Rd | |* | |* | |*
| |____| |* | |____| |* | |____| |* | |* | |*
| ____ |* | ____ |* | ____ |* | |* | |*
| | LS | |* | | LS | |* | | LS | |* | |* | |*
| |____| |* | |____| |* | |____| |* | |* | |*
| ____ |* | ____ |* | ____ |* | ____ |* | ____ |*
| | ES | |* | | ES | |* | | ES | |* | | ES | |* | | ES | |*
| |____| |* | |____| |* | |____| |* | |____| |* | |____| |*
| |* | |* | |* | |* | |*
------------* ------------* ------------* ------------* ------------*
As you can see from the diagram, I have 3 servers where processes are co-located.
This will be something that changes soon
The problem I've been having is I can't seem to break past an average of about 1200-1500 messages per second on my Indexing Rate.
I had a large amount of volume come in a few days ago as part of some testing, and I ended up with about 5-6million log messages sitting across my 3 Redis instances waiting to be indexed. At the time, I only had 4 cores on each machine. Since then I have grown the cores from 4 to 8 per machine and have setup much better monitoring on my server stats to try to find any bottlenecks
I tried many different things, but no matter what, it didn't seem to affect the ingestion rate.
I added a Threads option to the redis input block.
I took another server (not in the diagram above) and stood up more logstash indexers to pull from the different redis instances.
I read through the Elastic guide for tuning ES here:
https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html
Made sure I had no Swapping. increased the index thread pool.
I understand that this post doesn't provide enough information for someone to really and truly diagnose my problems, but I felt like it would be good to get it all written down, at least for my own benefit. And, maybe someone can point out something I'm just overlooking.
In the meantime I'm continuing to gather metrics on each piece of my stack to find the bottlenecks.
Some questions I have:
Redis - By default, Redis has it's Save and AOF options enabled. Is this something that, for people who are using Redis and getting high indexing rates, is being turned off? I have noticed that when Redis does a snapshot save, it stops reads/writes for a short time until the snapshot is complete
There's a lot of write-ups talking about how to maximize disk I/O, but in a corporate environment where you just really don't have a lot of insight into how the disks are setup, how do you troubleshoot that? I have monitors setup with Nagios and Check_MK to watch my I/O Reads/Write per VM, but I don't know how to take that information. If it's in a SAN, then technically my data could be being striped across many disks anyways. So should I still expect regular, single-spinning-disk write speeds?
When I looked at the CPU utilization during the heavy load times, I didn't see any CPUs being maxed. The utilization was hitting maybe.. 70-80% on average, but to me that doesn't mean 'maxed'. At the time they were 4 core machines, and the Load average on some was spiking up to around 5 every so often, but the 10-15 minute load averages were around 3.
My RAM usage doesn't seem unhealthy. I am running 8GB jvms on 16GB boxes. This is one area that I understand I may really get a performance boost on when I move Redis and Logstash off of the ES nodes and the Filesystem Cache is dedicated to what ES is doing, but I can't imagine that giving me 2-3 times the performance.