We're running Logstash on a single server to compare its performance to Heka and decide if we want to use one or the other. We have some high-traffic web servers of which the log lines are sent to a Kafka cluster consisting of 3 nodes. Running Heka all log message come in without a problem, but with Logstash we see enormous gaps of recorded log messages. As in: gaps of more than 10 minutes.
To minimize the load on the web server, it basically does no parsing. The only thing Logstash adds to the log line is the host name and delimiter char. The Logstash configuration:
input {
file {
path => "/var/log/nginx/access.log"
start_position => "beginning" # if we haven't seen the file before
}
}
output {
kafka {
broker_list => "kafka1:9092,kafka2:9092,kafka3:9092"
topic_id => "log_stats"
request_required_acks => 1
codec => plain {
format => '%{host}¦%{message}'
}
producer_type => async
}
}
The only overridden default setting is the heap size:
So far we haven't seen any error whatsoever in /var/log/logstash/*. Even with the log level decreased to info we don't see any alarming messages.
The amount of log lines added to the monitored files is 6M per hour = 100k per minute. The server in question has 20 cores and 128GB ram. The load is quite low. The active logstash thread uses about 5-13% of a single cpu core and 0.5% of the ram memory.
Is there something in this setup that's completely off, causing these long periods of inactivity? We've manually checked the access logs to see that there is activity during the gaps.
Are you seeing the gap in Elasticsearch? How many partitions are on the topic and how many consumer threads are there? If you have a mismatch of threads to partitions then what you can see is some partitions aren't serviced immediately because the others are full of data as well so a single consumer thread doesn't hit the others until it does a rebalance. You can see if you are getting partition starvation by checking the consumer offsets on those partitions for a particular consumer group by using a tool in the Kafka binary: bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker. I suspect this is similar to https://github.com/logstash-plugins/logstash-input-kafka/issues/32 Check out that thread for more detailed discussion of the issue.
At the start is the time of when Logstash creates the message. The nginx log line follows after the domain, which starts with the timestamp in ms (1439895678.299). This timestamp translates to
Tue Aug 18 11:01:18 UTC 2015
That is a half hour difference between the line logged by nginx and the corresponding message created by Logstash!
The gaps we see are always at the end of the hour, eg between 11:45 and 12:00. Considering we have an hourly log rotation, it's clear that the gaps are caused by log rotation combined with Logstash lagging behind.
I'm puzzled with this behavior, because the logstash process doesn't seem busy: it doesn't use much cpu (as written above, 5-13% of a single core) and it doesn't use the full heap size that it got assigned. Any ideas of what may cause this delay?
Looking at your offset, the lag is what you are interested in. IMO 1000's are a little high to be in but if it kinda hangs around that then you are okay. Guessing from your Owner row it appears that you have two or three logstash instances that supply read from Kafka and submit the logs to Elasticsearch all with three threads. What type of boxes are those, specifically how many cores? You should try to make sure your threads match the number of cores for best performance, otherwise the threads will just fight each other to process when a single thread can still handle all the partitions.
Other thought, get those nginx logs into a json format so there is almost zero processing that has to happen for logstash and you get those timestamps synced up. Check out this post for some guidance: https://blog.pkhamre.com/logging-to-logstash-json-format-in-nginx/
Yes, it's quite simple. The data isn't processed fast enough and it lags behind. Then, when the log rotation happens the data is cut off. So all log lines that haven't been read are lost and we see a gap in the graphs.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.