I was recently reviewing some of my logs being ingested -- some of them have a timestamp two weeks older than when they were finally ingested. To me, this indicates a two-week latency, likely in Logstash..
In order to remedy this, I'm thinking about adding more Logstash instances, however I don't have an infinite supply of machines. I looked at two of my storage nodes, which have 24 CPUs, yet never get more than 10% utilization.
Would it be unwise to leverage this additional CPU by running Logstash on the same boxes? Any other ideas for increasing the pipeline? A two-week latency makes this stack useless.
I'm not yet doing any queueing as you've mentioned, my logstash hosts are load-balanced from the client side -- Filebeat is given the host name of all LS instances and load_balance is set to true.
Is this non-ideal? I can set up a Redis instance or similar if you think that's wise.. I already have four LS instances running concurrently for this stack.
As a side note, I'm seeing indexing latency of only 10ms, so ES doesn't appear to be the bottleneck, as you've likely already established.
It might be good enough but I would probably not call it ideal. Does it distribute the load sufficiently evenly amongst the Logstash hosts?
What's the bottleneck here? Are the CPUs on the Logstash hosts saturated? If no, can you increase the number of pipeline workers? What kind of event rate are you getting per host?
Before I build this all out, is Redis the Elastic-recommended queueing service?
I don't know if Elastic recommends any product in this area.
We do not necessarily recommend any particular message queue. Different message queues have different properties, so which one is suitable depends on your requirements and also to some extent what you already are using or have experience with.
We see a lot of Redis used for smaller installations as it is easy to use, but as it holds data in memory you can lose data if it crashes and it may also fill up RAM if there is an outage. I have also seen RabbitMQ used and lately Kafka is very popular, especially at for larger deployments. Logstash have plugins that make it possible to also use other message queues than the mentioned ones.
The CPUs are definitely not saturated, these nodes have 24 CPUs each, and are way underutilized, rarely reaching even 100%.
If no, can you increase the number of pipeline workers?
How/where do I increase the pipeline workers? I saw a command I can run to start Logstash with a non-standard number of workers, but this should persist through reboots, service restarts etc..
What kind of event rate are you getting per host?
It's pretty poor right now -- about 500/s with 4 logstash hosts, however two of them are also dedicated ES master nodes. The messages can be pretty long & complex, but that's still a pretty shabby number. Under a similar setup, I used to be receiving 1200-1600 on average, but something must have changed to really reduce that number.
I'm somewhat convinced Logstash is the bottleneck, as the indexing latency in ES is only about 7ms. The only other potential bottleneck I can see would be the client nodes being monitored. They're pretty beefy boxes as well, with 24 CPUs and 128gb memory. I'm primarily using Filebeat on those machines and am planning to upgrade to filebeat 5.0 today. I suppose I'll also up the number of workers to 24 or so on those boxes as well.
I'd prefer to not use Redis or Kafka, as it's another point of failure and maintenance, but if these worker settings don't change much, I'll likely implement a queueing service to reduce latency.
Thanks @Christian_Dahlqvist, I have some coworkers on site here that really like Kafka, so that may be how I proceed if I find the need to use a queue. Although Redis seems to have more documentation as far as integration with Logstash.
How/where do I increase the pipeline workers? I saw a command I can run to start Logstash with a non-standard number of workers, but this should persist through reboots, service restarts etc..
/etc/sysconfig/logstash (RPM) or /etc/default/logstash (Debian) let's you change the default startup options. This might work differently with systemd.
It's pretty poor right now -- about 500/s with 4 logstash hosts, however two of them are also dedicated ES master nodes.
Yeah, 500 events/s is not impressive for four Logstash instances.
To find the bottleneck, isolate the different systems. For example, let Logstash send events to /dev/null. How does that affect the event rate?
@magnusbaeck Is there config in that file to modify the number of pipeline workers? I see the config for other parameters but not PW;
###############################
# Default settings for logstash
###############################
# Override Java location
#JAVACMD=/usr/bin/java
# Set a home directory
#LS_HOME=/var/lib/logstash
# Arguments to pass to logstash agent
#LS_OPTS=""
# Arguments to pass to java
#LS_HEAP_SIZE="1g"
#LS_JAVA_OPTS="-Djava.io.tmpdir=$HOME"
# pidfiles aren't used for upstart; this is for sysv users.
#LS_PIDFILE=/var/run/logstash.pid
# user id to be invoked as; for upstart: edit /etc/init/logstash.conf
#LS_USER=logstash
# logstash logging
#LS_LOG_FILE=/var/log/logstash/logstash.log
#LS_USE_GC_LOGGING="true"
#LS_GC_LOG_FILE=/var/log/logstash/gc.log
# logstash configuration directory
#LS_CONF_DIR=/etc/logstash/conf.d
# Open file limit; cannot be overridden in upstart
#LS_OPEN_FILES=16384
# Nice level
#LS_NICE=19
# If this is set to 1, then when `stop` is called, if the process has
# not exited within a reasonable time, SIGKILL will be sent next.
# The default behavior is to simply log a message "program stop failed; still running"
KILL_ON_STOP_TIMEOUT=0
Is there certain config I can add in order to add pipeline workers?
@Christian_Dahlqvist what config would be beneficial to provide you with? My LS filters etc? I ask because that's quite a bit of info, the primary use of this ELK stack is to monitor Nginx logs, which must be pretty heavily parsed when indexing every part of that log, including the verbose URL embedded in each event. I actually remove the message to save space after successful parsing, but the average event is somewhere around 10-20 KB.
I am not sure exactly what I am looking for, but it would be useful to see an example event, what you want to parse it into as well as get an idea about which filters you use to parse it. It would be great if you could share this, e.g. through a Gist, so we can get a better idea about the complexity. I have on occasion seen inefficient use of filters, e.g. over-reliance on grok when other filters would be more efficient, cause significant slowdown of the pipeline.
@magnusbaeck thank you! I'm tuning the batch and worker parameters now, it seems to provide somewhat higher throughput, but I still need to make some changes;
the current entry for batch & pipeline workers:
LS_OPTS="-b 8000 -w 48"
Puts the node in a state where ~12g memory is consumed, but still only ~100%CPU. On a machine capable of 1600%CPU, my Logstash config is still in need of some tuning. However, this does GREATLY improve performance already.
@Christian_Dahlqvist I'll compile everything into a Gist this evening and provide it to you. I currently have >30 filters on this stack and can likely clean those up before sending to you.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.