Big gaps in recorded logs

bschopman · August 12, 2015, 2:29pm

We're running Logstash on a single server to compare its performance to Heka and decide if we want to use one or the other. We have some high-traffic web servers of which the log lines are sent to a Kafka cluster consisting of 3 nodes. Running Heka all log message come in without a problem, but with Logstash we see enormous gaps of recorded log messages. As in: gaps of more than 10 minutes.

To minimize the load on the web server, it basically does no parsing. The only thing Logstash adds to the log line is the host name and delimiter char. The Logstash configuration:

input {
  file {
    path  => "/var/log/nginx/access.log"
    start_position => "beginning" # if we haven't seen the file before
  }
}
output {
    kafka {
        broker_list => "kafka1:9092,kafka2:9092,kafka3:9092"
        topic_id => "log_stats"
        request_required_acks => 1
        codec => plain {
            format => '%{host}¦%{message}'
        }
        producer_type => async
    }
}

The only overridden default setting is the heap size:

LS_HEAP_SIZE="1500m"

Resulting in the following running process:

$ ps aux | grep logstash
logstash 41920  9.6  0.5 4385308 669724 pts/3  SNl  13:01   4:08 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:./logstash-gc.log -Xmx1500m -Xss2048k -Djffi.boot.library.path=/opt/logstash/vendor/jruby/lib/jni -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -Djava.awt.headless=true -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/var/lib/logstash -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:./logstash-gc.log -Xbootclasspath/a:/opt/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/opt/logstash/vendor/jruby -Djruby.lib=/opt/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /opt/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /etc/logstash/conf.d -l /var/log/logstash/logstash.log --verbose

We're running Oracle java 1.7.0_80.

So far we haven't seen any error whatsoever in /var/log/logstash/*. Even with the log level decreased to info we don't see any alarming messages.

The amount of log lines added to the monitored files is 6M per hour = 100k per minute. The server in question has 20 cores and 128GB ram. The load is quite low. The active logstash thread uses about 5-13% of a single cpu core and 0.5% of the ram memory.

Is there something in this setup that's completely off, causing these long periods of inactivity? We've manually checked the access logs to see that there is activity during the gaps.

warkolm · August 12, 2015, 11:56pm

I'd suggest looking at the heartbeat plugin to see if you can determine if LS is pausing or if it might be something else.

You can also try running in --verbose and/or --debug modes, but that'll get pretty noisy.

Joe_Lawson · August 14, 2015, 6:12pm

Are you seeing the gap in Elasticsearch? How many partitions are on the topic and how many consumer threads are there? If you have a mismatch of threads to partitions then what you can see is some partitions aren't serviced immediately because the others are full of data as well so a single consumer thread doesn't hit the others until it does a rebalance. You can see if you are getting partition starvation by checking the consumer offsets on those partitions for a particular consumer group by using a tool in the Kafka binary: bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker. I suspect this is similar to https://github.com/logstash-plugins/logstash-input-kafka/issues/32 Check out that thread for more detailed discussion of the issue.

bschopman · August 18, 2015, 12:16pm

Thanks for the replies. Here's what the ConsumerOffsetChecker gives:

    schopman@kafka02001:/opt/kafka$ bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --zookeeper zk:2181 --topic stats --group stats_group
    Group       Topic                      Pid Offset          logSize         Lag             Owner
    stats_group stats                      0   19762754697     19762762247     7550            stats_group_spark02001.domain.com-1439827006595-f1bc4064-0
    stats_group stats                      1   19870546253     19870554978     8725            stats_group_spark02001.domain.com-1439827006595-f1bc4064-0
    stats_group stats                      2   19960084344     19960093310     8966            stats_group_spark02001.domain.com-1439827006595-f1bc4064-0
    stats_group stats                      3   19943964765     19943972606     7841            stats_group_spark02001.domain.com-1439827006595-f1bc4064-1
    stats_group stats                      4   19864658095     19864667346     9251            stats_group_spark02001.domain.com-1439827006595-f1bc4064-1
    stats_group stats                      5   19960056579     19960065712     9133            stats_group_spark02001.domain.com-1439827006595-f1bc4064-1
    stats_group stats                      6   19759376148     19759384078     7930            stats_group_spark02001.domain.com-1439827006595-f1bc4064-2
    stats_group stats                      7   19873547885     19873557164     9279            stats_group_spark02001.domain.com-1439827006595-f1bc4064-2
    stats_group stats                      8   19963853577     19963862948     9371            stats_group_spark02001.domain.com-1439827006595-f1bc4064-2
    stats_group stats                      9   19939534399     19939539183     4784            stats_group_spark02002.domain.com-1439827008412-68947d80-0
    stats_group stats                      10  19868892614     19868899862     7248            stats_group_spark02002.domain.com-1439827008412-68947d80-0
    stats_group stats                      11  19954080890     19954086385     5495            stats_group_spark02002.domain.com-1439827008412-68947d80-1
    stats_group stats                      12  19766966638     19766971724     5086            stats_group_spark02002.domain.com-1439827008412-68947d80-1
    stats_group stats                      13  19870931440     19870938649     7209            stats_group_spark02002.domain.com-1439827008412-68947d80-2
    stats_group stats                      14  19961013687     19961019275     5588            stats_group_spark02002.domain.com-1439827008412-68947d80-2
    stats_group stats                      15  19941549743     19941554704     4961            stats_group_spark02002.domain.com-1439827008412-f83c964a-0
    stats_group stats                      16  19871918725     19871925927     7202            stats_group_spark02002.domain.com-1439827008412-f83c964a-0
    stats_group stats                      17  23288972312     23288978001     5689            stats_group_spark02002.domain.com-1439827008412-f83c964a-1
    stats_group stats                      18  19764435777     19764440766     4989            stats_group_spark02002.domain.com-1439827008412-f83c964a-1
    stats_group stats                      19  19866098656     19866105890     7234            stats_group_spark02002.domain.com-1439827008412-f83c964a-2
    stats_group stats                      20  19960739425     19960745355     5930            stats_group_spark02002.domain.com-1439827008412-f83c964a-2
    schopman@kafka02001:/opt/kafka$

I'm actively doing investigations on this topic, but does this output look healthy to you? (I don't see any issues.)

bschopman · August 18, 2015, 12:33pm

To investigate I've let Logstash write to stdout for a bit. What I see is quite alarming. Here's a line that I see:

2015-08-18T11:33:26.108Z server.domain.com 1439895678.299¦200¦http://www.dom.com/sub/¦Mozilla/5.0 agent¦0.000¦...

At the start is the time of when Logstash creates the message. The nginx log line follows after the domain, which starts with the timestamp in ms (1439895678.299). This timestamp translates to
Tue Aug 18 11:01:18 UTC 2015
That is a half hour difference between the line logged by nginx and the corresponding message created by Logstash!

The gaps we see are always at the end of the hour, eg between 11:45 and 12:00. Considering we have an hourly log rotation, it's clear that the gaps are caused by log rotation combined with Logstash lagging behind.

I'm puzzled with this behavior, because the logstash process doesn't seem busy: it doesn't use much cpu (as written above, 5-13% of a single core) and it doesn't use the full heap size that it got assigned. Any ideas of what may cause this delay?

Joe_Lawson · August 18, 2015, 2:02pm

I wonder if you are getting bit by nginx not really supporting logrotation. Most articles suggest a kill -USR1 in the script: https://www.loggly.com/docs/log-rotate/ and https://www.digitalocean.com/community/tutorials/how-to-configure-logging-and-log-rotation-in-nginx-on-an-ubuntu-vps

Side comments ...

Looking at your offset, the lag is what you are interested in. IMO 1000's are a little high to be in but if it kinda hangs around that then you are okay. Guessing from your Owner row it appears that you have two or three logstash instances that supply read from Kafka and submit the logs to Elasticsearch all with three threads. What type of boxes are those, specifically how many cores? You should try to make sure your threads match the number of cores for best performance, otherwise the threads will just fight each other to process when a single thread can still handle all the partitions.

Other thought, get those nginx logs into a json format so there is almost zero processing that has to happen for logstash and you get those timestamps synced up. Check out this post for some guidance: https://blog.pkhamre.com/logging-to-logstash-json-format-in-nginx/

Joe_Lawson · August 25, 2015, 2:18pm

@bschopman did you ever figure this out?

bschopman · August 25, 2015, 3:27pm

Yes, it's quite simple. The data isn't processed fast enough and it lags behind. Then, when the log rotation happens the data is cut off. So all log lines that haven't been read are lost and we see a gap in the graphs.