Must delete files in message queue to start logstash

For some reason if i restart logstash and there is a few gig of messages in the queue logstash complains that there is a config error

[2017-01-17T20:34:41,732][ERROR][logstash.agent           ] fetched an invalid config {:config=>"input {\n  beats 
blah blah blah
:reason=>"first seqNum=0 is different than minSeqNum=2863979"}

There is nothing wrong with the config. I've tested the config many times it is fine.
If i delete all the data in the logstash queue directory, logstash starts up just fine.
anyone else experience any issues like this?

What does your config look like?

filter {
  mutate { 
    add_field => {
      "vcid" => "1"
      "collector" => "logstash1"
    }
  }
  if [beat][name] =~ /^(filebeat|metricbeat)$/ {
    mutate {
      remove_field => [ "host" ]
      rename => { "[beat][hostname]" => "hostname" }
      lowercase => [ "hostname" ]
    }
  }
  if [beat][name] == 'filebeat' {
    if [source] =~ /^\/var\/log\/(maillog|messages|secure)$/ {
      grok {
        match => {
          "message" => [
            "%{SYSLOGBASE} %{GREEDYDATA:logmsg}",
            "%{TIMESTAMP_ISO8601:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}: %{GREEDYDATA:logmsg}"
          ]
        }
      }
      mutate {
        strip => [ "logmsg" ]
      }
    }
    if [program] == 'pdns_recursor' {
      grok {
        match => {
          "logmsg" => [
            "question answered from packet cache from %{IP:src_ip}",
            "^[0-9] \[%{GREEDYDATA}\] question for \'%{DATA:query}\.\|%{WORD:qtype_name}\' from %{IP:src_ip}",
            "^[0-9] \[%{GREEDYDATA}\] answer to question \'%{DATA:query}\.\|%{DATA:qtype_name}\': %{GREEDYDATA}"
          ]
        }
      }
    }
    if [program] == 'sshd' {
      grok {
        match => {
          "logmsg" => [
            "pam_unix\(sshd:session\): session opened for user %{USERNAME:username}",
            "Accepted publickey for %{USERNAME:username} from %{IP:src_ip} port %{POSINT:src_port}",
            "Accepted password for %{USERNAME:username} from %{IP:src_ip} port %{POSINT:src_port}",
            "pam_unix\(sshd:session\): session closed for user %{USERNAME:username}",
            "Received disconnect from %{IP:src_ip}:",
            "Did not receive identification string from %{IP:src_ip}",
            "invalid user %{USERNAME:username}",
            "Invalid user %{USERNAME:username} from %{IP:src_ip}",
            "Connection closed by %{IP:src_ip}",
            "User %{USERNAME:username} from %{IP:src_ip} not allowed"
          ]
        }
      }
    }
    if [program] == 'sudo' {
      grok {
        match => {
          "logmsg" => [
            "^%{USERNAME:username} : %{GREEDYDATA} ; COMMAND=%{GREEDYDATA:command}"
          ]
        }
      }
    }
    if [program] == 'systemd' {
      if [logmsg] =~ /^(Started|Starting) Session/ {
        grok {
          match => {
            "logmsg" => [
              "^(Started|Starting) Session %{POSINT:session_id} of user ((%{USERNAME:username})\.|(%{USERNAME:username}))"
            ]
          }
        }
      }
    }
    if [program] == 'systemd-logind' {
      grok {
        match => {
          "logmsg" => [
             "New session %{POSINT:session_id} of user %{USERNAME:username}",
             "Removed session %{POSINT:session_id}"
          ]
        }
      }
    }
    if [source] =~ /^\/var\/log\/httpd\// {
      grok {
        add_field => { "program" => "apache" } 
        match => {
          "message" => [
            "%{IP:src_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:method} %{NOTSPACE:uri}(?: HTTP/%{NUMBER:http_version})?|%{DATA:rawrequest})\" %{NUMBER:status_code} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:user_agent}",
            "%{IP:src_ip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] \"(?:%{WORD:method} %{NOTSPACE:uri}(?: HTTP/%{NUMBER:http_version})?|%{DATA:rawrequest})\" %{NUMBER:status_code} (?:%{NUMBER:bytes}|-)",
            "\[%{HTTPDERROR_DATE:timestamp}\] \[:%{LOGLEVEL:loglevel}\] \[pid %{POSINT:pid}\] \[client %{IP:src_ip}:%{POSINT:src_port}\] %{GREEDYDATA:logmsg}, referer: %{NOTSPACE:referrer}"
          ]
        }
      }
      mutate {
        strip => [ "logmsg" ]
      }
    }
    date {
      match => [
        "timestamp",
        "MMM dd YYY HH:mm:ss",
        "MMM  d YYY HH:mm:ss",
        "MMM  d HH:mm:ss",
        "MMM dd HH:mm:ss",
        "dd/MMM/yyyy:HH:mm:ss Z",
        "EEE MMM dd HH:mm:ss.SSSSSS yyyy",
        "ISO8601"
      ]
    }
    if "_grokparsefailure" in [tags] {
      mutate {
        add_field => { "parse_failure" => "true" }
      }
    }
    if "_dateparsefailure" in [tags] {
      mutate {
        add_field => { "parse_failure" => "true" }
        rename => { "timestamp" => "log_timestamp_err" }
      }
    }
    if [parse_failure] != "true" {
      mutate {
        remove_field => [ "message" ]
        remove_field => [ "timestamp"]
      }
    }
  }
  if [beat][name] == 'metricbeat' {
    if [metricset][module] == 'system' {
      mutate {
        rename => { "[system][cpu]" => "[cpu]" }
        rename => { "[system][core]" => "[core]" }
        rename => { "[system][diskio]" => "[diskio]" }
        rename => { "[system][filesystem]" => "[filesystem]" }
        rename => { "[system][fsstat]" => "[fsstat]" }
        rename => { "[system][load]" => "[load]" }
        rename => { "[system][memory]" => "[memory]" }
        rename => { "[system][network]" => "[network]" }
        rename => { "[system][process]" => "[process]" }
        rename => { "[process][username]" => "username" }
      }
    }
  }
}

i stripped the input and output

this is my logstash.yml file.
logstash is running on a dedicated virtual machine with 8gb of ram and 8 cpu's

---
config: 
  debug: false
  reload: 
    automatic: true
    interval: 60
  test_and_exit: false
http: 
  host: 0.0.0.0
  port: 9600
log: 
  format: plain
  level: info
node: 
  name: vesl1.vigilant.dna
path: 
  config: /etc/logstash/conf.d
  data: /var/lib/logstash
  logs: /opt/data/logs/logstash
  plugins: []
  queue: /var/lib/logstash/queue
pipeline: 
  batch: 
    delay: 5
    size: 5000
  output: 
    workers: 4
  unsafe_shutdown: false
  workers: 12
queue: 
  checkpoint: 
    acks: 1024
    interval: 1000
    writes: 1024
  max_bytes: 8192mb
  max_events: 0
  page_capacity: 250mb
  type: persisted

also, like I said, the logstash config test passes if I delete all the messages in the queue

It also takes forever just to test the config..... notice before i delete the messages in the queue it doesn't say if anything is good or bad... then after i delete the messages it says the config is good.

[panaman@logstash1 ~]$ date;sudo /usr/share/logstash/bin/logstash -t -f /etc/logstash/conf.d/logstash.conf --path.settings /etc/logstash --log.level=debug --config.debug;date
Wed Jan 18 15:50:27 UTC 2017
Sending Logstash's logs to /opt/data/logs/logstash which is now configured via log4j2.properties
Wed Jan 18 15:51:10 UTC 2017        
[panaman@logstash1 ~]$ ls /data/queue/
checkpoint.1     checkpoint.12    checkpoint.4     checkpoint.7     checkpoint.head  page.11          page.2           page.5           page.8
checkpoint.10    checkpoint.2     checkpoint.5     checkpoint.8     page.1           page.12          page.3           page.6           page.9
checkpoint.11    checkpoint.3     checkpoint.6     checkpoint.9     page.10          page.13          page.4           page.7           
[panaman@logstash1 ~]$ sudo rm -f /data/queue/*
[panaman@logstash1 ~]$ date;sudo /usr/share/logstash/bin/logstash -t -f /etc/logstash/conf.d/logstash.conf --path.settings /etc/logstash --log.level=debug --config.debug;date
Wed Jan 18 15:51:54 UTC 2017
Sending Logstash's logs to /opt/data/logs/logstash which is now configured via log4j2.properties
Configuration OK
Wed Jan 18 15:53:05 UTC 2017

Same error here.
How to fix?

some more awesomeness

[2017-01-23T13:41:19,787][WARN ][io.netty.channel.DefaultChannelPipeline] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.io.IOException: Too many open files
	at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[?:1.8.0_111]
	at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) ~[?:1.8.0_111]
	at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[?:1.8.0_111]
	at io.netty.channel.socket.nio.NioServerSocketChannel.doReadMessages(NioServerSocketChannel.java:135) ~[netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.channel.nio.AbstractNioMessageChannel$NioMessageUnsafe.read(AbstractNioMessageChannel.java:75) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:571) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:512) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:426) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:398) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:877) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-all-4.1.3.Final.jar:4.1.3.Final]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_111]

logstash should have plenty of room with 65535

logstash - nofile 65535
logstash - nproc 32768
logstash - memlock unlimited

This is an error reported by the persistent queue - it is saying that it is expecting a sequence number of 2863979 for a particular page of the queue. the first event on this page has a sequence number of zero - meaning that something went wrong when the first event was written to the page. We will improve the error message to tell us which page this occurred on.

We are looking into this... A) how can we fix the data and B) how can we fix the code (bug).

IMPORTANT: Do not delete the pages and checkpoint files unless you want to throw out your events and can recreate them. If you need to keep the events then move those files to another folder. Logstash will start in this case.

@panaman For some reason the persistent queue exception at startup "first seqNum=0 is different than minSeqNum=2863979" is reported as a config error. It is NOT config related and there is a bug in bubbling up that exception and we will look into that.

The real problem here is related to a queue data page that is incorrectly encoded and when LS startup and try to open that page it stops with that exception (which is incorrectly reported as a config error)

We have received other similar reports for that problem and we are looking into it. As you figured out, one way to solve it is to either delete the queue directory content and restart LS - BUT OBVIOUSLY this will delete the queue content and you will loose any unprocessed data in that queue. You can also simply move/rename the queue directory to another name and restart LS, this way you can keep the troublesome queue content aside until we figure out the problem.

Also, the java.io.IOException: Too many open files is probably NOT related to the same issue here.

@panaman It would be helpful if you could share more details about the context of this exception. In particular: how was LS stopped prior to restarting it and you first got that exception? did it crashed or was it a graceful shutdown? did you notice any error logs before restarting?

It seems like its not rotating the page files properly either. they build up and then eventually when it hits the limit queue.max_bytes: is when the hard crash happens

I notice the problem happen when a new config gets pushed down ( i have the auto reload of the logstash.conf turned on) or if I "systemctl restart logstash"

I am sending my events to kafka now, and just switched it from persistent queue to memory queue. Logstash is performing well in this configuration.

@panaman thanks for the info, a few observations/questions:

  • page purge only happens when a page events are fully sent through the outputs. If the queue grows up to queue.max_bytes this means that the outputs are slower than the input and when queue.max_bytes is reached, it will start to apply back-pressure on the inputs. Unless there is something else going on here, when switching to using Kafka, with the same data context, you should see similar events accumulation in Kafka.

  • you mention a "hard crash" - when exactly does this happen and do you have any logs/exception report for that crash?

  • did you first notice that problem with a config reload or logstash restart OR did it start to occur after that crash you mentioned above?

Thanks a lot for your feedback, this is all very valuable information.

it seems to happen with a config reload or restart, not sure which one is first, i believe that just loading the new config is what causes the issue. The only error I see on my logstash server is that too many open files error when the hard crash happens. This machine is dedicated to accepting filebeat and metricbeat in our environement. (8 cpu cores and 8gb of ram). It performs no other duty. I would think the limits I have set would be just fine.

Also I am not seeing any back pressure on kafka.(i've been running in the new memory queue config for the last 2 days) I have a feeling that the bug you mentioned above is why the page files are not being deleted in the first place.

I hope this helps

@panaman so, once the "first seqNum=0 is different than minSeqNum=xxxx" exception happen on startup, it will always happen, as you figured, unless you wipe/moved the queue data dir. What I am trying to understand is exactly what happened before that first restart exception occurred.

  • Did you first get a too many open files error which crashed logstash and then upon restart the "first seqNum=0 is different than minSeqNum=xxxx" exception occurred?

  • Can you share or describe you config, including input(s) and output(s) prior to using Kafka and then with Kafka?

I was always sending to kafka, i recently changed it from persistent to memory. I figured more than 1 queue is better. My filter is above so I won't include that.

the first seqNum came first, too many open files error came later.
I could probably put it back to persistent queues again and test

INPUT

input {
  beats {
    port => "5045"
    host => "10.10.25.78"
    ssl => true
    ssl_certificate_authorities => ["/etc/ssl/ca-chain.crt"]
    ssl_certificate => "/etc/ssl/ssl_cert.crt"
    ssl_key => "/etc/ssl/ssl.p8.key"
    ssl_verify_mode => "force_peer"
    include_codec_tag => false
    tls_min_version => "1.2"
  }
}

OUTPUT

output {
  kafka {
    codec => json_lines
    topic_id => "logstash"
    security_protocol => "SSL" 
    bootstrap_servers => "kafka1:9093"
    ssl_keystore_location => "/etc/ssl/keystore.jks"
    ssl_keystore_password => "password"
    ssl_truststore_location => "/etc/ssl/ca-chain.jks"
    ssl_truststore_password => "password"
  }
}

@panaman Ok, so there was no crash prior to the first "first seqNum=0 is different than minSeqNum=xxxx" exception, it first occurred following a config reload?

If your memory queuing works, I would suggest you delay re-enabling persistent queue until we nail that bug otherwise you risk having the same problem. Of course, if you are willing to do some more diagnosis to help us figure the bug cause it will definitely be helpful :smiley:

yeah its running good with memory queue.... not using too many resources.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
 4606 logstash  39  19 11.997g 1.534g  17328 S  19.0 20.1  11:19.53 java      

I am basically in a testing phase and working on logstash filters so I can put it back to persistent to help you troubleshoot. Is there any other settings you would like me to set? Debug on? anything like that....

I too have experience the config error thats not a config error in regards to using the persisted queue. I have also witnessed the queue fill up when my three logstash nodes were doing well under the events per second out that my elasticsearch cluster has demonstrated it can do in. When this happened, there was no log output and no data ingestion at all. I had hundreds of queue files and all the logs for elasticsearch and logstash were business as usual. However, all the beats clients reported a connection timeout. I restarted logstash and got the dreaded seq error and went back to the in-memory queue at which everything started flowing again.

I can also replicate the issue very easily by either config testing on the same host as the logstash instance running with persisted queue or by having the auto reload enabled and reloading the pipeline. I wasn't expecting however to have the queue fill up and then generate the error even though I had been purposeful in not reloading the config or running a config test while logstash was up.

We finally found & fixed a important autoreload problem related to PQ, plus a few other important PQ issues. These will be part of the 5.3.0 release and we are looking at backporting these fixes into 5.2.2.

3 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.