LSF Error: Skipping file (older than dead time of 24h0m0s):


(Fatema Bannat Wala) #1

Hi,

I have an ELK cluster. I have some files in a input dir which I need to forward to the ELK via LSF. My ELK and LSF are on the same machine. When I restarted ELK and LSF I am getting following msgs in LSF err logs:

2015/10/24 15:58:10.816007 Launching harvester on new file: /mnt/ssd/ELK_input/ELK-input/oath.log
2015/10/24 15:58:10.816233 Launching harvester on new file: /mnt/ssd/ELK_input/ELK-input/vpn_20151020.log
2015/10/24 15:58:10.816268 Launching harvester on new file: /mnt/ssd/ELK_input/ELK-input/syslog
2015/10/24 15:58:10.816496 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.12:00:00-13:00:00.log
2015/10/24 15:58:10.816524 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.13:00:00-14:00:01.log
2015/10/24 15:58:10.816541 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.14:00:01-15:00:00.log
2015/10/24 15:58:10.816557 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.15:00:00-16:00:00.log
2015/10/24 15:58:10.816575 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.16:00:00-17:00:00.log
2015/10/24 15:58:10.816591 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.17:00:00-18:00:00.log
2015/10/24 15:58:10.816603 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.18:00:00-19:00:00.log
2015/10/24 15:58:10.816618 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.19:00:00-20:00:00.log
2015/10/24 15:58:10.816631 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.20:00:00-21:00:00.log
2015/10/24 15:58:10.816645 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.21:00:00-22:00:00.log
2015/10/24 15:58:10.816657 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.22:00:00-23:00:00.log
2015/10/24 15:58:10.816669 Skipping file (older than dead time of 24h0m0s): /mnt/ssd/ELK_input/ELK-input/communication.23:00:00-00:00:00.log

I have pasted all the input files in the mnt/ssd/ELK_input/ELK-input/ dir at the same time, but some of them are getting harvested and forwarded by LSF, but some are getting skipped. All the log files are consisting events of the same day, so how LSF is randomly choosing between some files to forward and some to skip?

Following are some config details:
~Machine: Linux elk-test 3.10.0-229.7.2.el7.x86_64 #1 SMP Tue Jun 23 22:06:11 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
~ES version: 1.7
~LS version: 1.5
~Kibana version: 4.1.2
~LSF version: 0.4.0

logstash-forwarder.conf :

{

The network section covers network configuration

"network": {

"servers": [ "localhost:5000" ],
"ssl ca": "/etc/pki/tls/certs/logstash-forwarder.crt",
"timeout": 15

},

The list of files configurations

"files": [
{
"paths": [ "/mnt/ssd/ELK_input/ELK-input/oath.log" ],
"fields": { "type": "oathlog" }
},
{
"paths": [ "/mnt/ssd/ELK_input/ELK-input/vpn_.log" ],
"fields": { "type": "vpnlog" }
},
{
"paths": [ "/mnt/ssd/ELK_input/ELK-input/syslog" ],
"fields": { "type": "syslog" }
},
{
"paths": [ "/mnt/ssd/ELK_input/ELK-input/communication.
.log" ],
"fields": { "type": "communicationlog" }
}
]
}

Any help would be appreciated. Also, when I am copy pasting some new files in the input dir, they are also skipped.

Thanks,
Fatema.


(Magnus Bäck) #2

The key here is modification times of the files. Did e.g. /mnt/ssd/ELK_input/ELK-input/communication.22:00:00-23:00:00.log have a modification time earlier than 24 hours back?


(Fatema Bannat Wala) #3

Hi Magnus,

Thanks for the quick response.
Yes, the communication logs are old (4 days old).
I have tons of data that is months old, I want to parse them all through LS to store them in ES (as a central log storage database), so that means I can't forward those logs through LSF?

Couple of more queries:

  1. How to parse months old data (whose modification dates are months old) through LS?

  2. Also, majority of log files are gunzipped by the application on auto-rotate (i.e logs are in .gz format), How to parse them through LS via LSF? I tried forwarding them through LSF, but failed parsing through LS and looked ugly (human unreadable) on Kibana, so I manually uncompressed those gz files before forwarding them to LS. Is their an easy way to do this. I have 100s of gigs of zipped logs which I don't want to manually unzip :frowning:
    Also they consume more than double of disk on decompression than zipped files.

  3. Sometimes the LSF works fine, but some times when I restart the LSF and LS I get following error:
    2015/10/24 19:01:03.930173 Connected to 127.0.0.1
    2015/10/24 19:01:08.954594 Read error looking for ack: EOF
    2015/10/24 19:01:08.954679 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
    2015/10/24 19:01:08.955021 Connecting to [127.0.0.1]:5000 (localhost)
    2015/10/24 19:01:09.008232 Connected to 127.0.0.1

what does " Read error looking for ack: EOF" error means? and how to avoid it?

I know couple of queries may be trivial but I am kind of new to ELK and tried searching on internet but with no luck.

Thanks,
Fatema.


(Magnus Bäck) #4

I have tons of data that is months old, I want to parse them all through LS to store them in ES (as a central log storage database), so that means I can't forward those logs through LSF?

You can, but you'll have to adjust the dead time configuration parameter. Something like this should do it to get it to process files up to 100 days of age:

{
  "files": [
    {
      "paths": [...],
      "dead time": "2400h",
      ...
    },
    ...
}

How to parse months old data (whose modification dates are months old) through LS?

I don't think there are any gotchas, except that logs whose timestamp lacks a timezone and/or year could be tricky.

Also, majority of log files are gunzipped by the application on auto-rotate (i.e logs are in .gz format), How to parse them through LS via LSF?

You'll have to gunzip them first. If you don't have room to gunzip them all you'll have to write a script that gunzips one log file and makes sure it's processed, then deletes that log file and continues with the next. Unfortunately, Logstash's file input has no concept of "done" so it can't tell you when you can delete the file. You could use the stdin input which allows Logstash to terminate upon EOF, but if Logstash is interrupted it won't be able to pick up where it left off. Perhaps splitting a single gunzipped file into multiple files would be an option. Or you could read the sincedb file to determine whether a file has been completely processed by the file input.


(Fatema Bannat Wala) #5

Thanks for your comments. I will try them.

Also, now I am running into another issue. Whenever I am restarting my LS I am getting these logs in logstash.log file:
}
{:timestamp=>"2015-10-23T12:07:07.446000-0400", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::Size
dQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-23T12:07:07.447000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the
input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBrea
ker, :level=>:warn}
{:timestamp=>"2015-10-23T12:07:07.476000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:wa
rn}
{:timestamp=>"2015-10-23T12:07:07.519000-0400", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2015-10-23T12:07:07.521000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the
input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::OpenBreaker,
:level=>:warn}
{:timestamp=>"2015-10-23T12:07:09.027000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:wa
rn}
{:timestamp=>"2015-10-23T12:07:10.713000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:wa
rn}
{:timestamp=>"2015-10-23T12:07:12.217000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:wa
rn}
{:timestamp=>"2015-10-23T12:07:13.969000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:wa
rn}

.... and many many more

And because of this LSF is also not able to connect to LS and I think that's the reason I am getting "Read error looking for ack: EOF" error in LSF log file.
I have read many posts on internet suggesting variety of methods. some say to increase the number of LS instances on the machine, some says it's probably the filters (multiline ones) etc. I don't know the right fix yet.

I read the ES documentation https://www.elastic.co/guide/en/logstash/current/advanced-pipeline.html but doesn't help. Also in one of the documentations it says that Logstash sets the size of each queue to 20. This means a maximum of 20 events can be pending for the next stage.. Is this the problem because LSF is processing 1000s of events at the same time and I have six filter files in conf.d dir of logstash with one input and one output file?

Can you provide a fix or any suggesstionsfor this?


(Magnus Bäck) #6

{:timestamp=>"2015-10-23T12:07:07.447000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the
input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBrea
ker, :level=>:warn}

Is there anything in the log indicating why the pipeline is clogged? What outputs do you have?

Also in one of the documentations it says that Logstash sets the size of each queue to 20. This means a maximum of 20 events can be pending for the next stage.. Is this the problem because LSF is processing 1000s of events at the same time and I have six filter files in conf.d dir of logstash with one input and one output file?

No, that's not the problem. The problem is that your output is way slower than your input.


(Fatema Bannat Wala) #7

In logstash.log :

{:timestamp=>"2015-10-26T13:05:40.351000-0400", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:40.357000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:45.587000-0400", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:45.593000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:50.837000-0400", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:50.840000-0400", :message=>"Lumberjack input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2015-10-26T13:05:55.995000-0400", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}

{:timestamp=>"2015-10-26T13:06:01.752000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-26T13:06:02.253000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-26T13:06:02.754000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-26T13:06:03.256000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-26T13:06:03.757000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-10-26T13:06:04.257000-0400", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
...& so on

In logstash-forwarder.err:

2015/10/26 13:05:45.592768 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/10/26 13:05:45.593409 Connecting to [::1]:5000 (localhost)
2015/10/26 13:05:45.673180 Connected to ::1
2015/10/26 13:05:50.840360 Read error looking for ack: EOF
2015/10/26 13:05:50.840492 Setting trusted CA from file: /etc/pki/tls/certs/logstash-forwarder.crt
2015/10/26 13:05:50.841120 Connecting to [127.0.0.1]:5000 (localhost)
2015/10/26 13:05:50.910191 Connected to 127.0.0.1
2015/10/26 13:05:56.000465 Read error looking for ack: EOF
..... & so on..

& in elasticsearch.log I only have this licence errors:

[2015-10-26 12:09:04,341][INFO ][http ] [tsg1.node1_test] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[localhost/127.0.0.1:9200]}
[2015-10-26 12:09:04,345][INFO ][node ] [tsg1.node1_test] started
[2015-10-26 12:09:04,362][INFO ][watcher ] [tsg1.node1_test] starting watch service...
[2015-10-26 12:09:04,364][INFO ][license.plugin.core ] [tsg1.node1_test] license for [watcher] - valid
[2015-10-26 12:09:04,368][ERROR][watcher.license ] [tsg1.node1_test]

[2015-10-26 12:09:04,368][INFO ][shield.license ] [tsg1.node1_test] enabling license for [shield]
[2015-10-26 12:09:04,369][INFO ][license.plugin.core ] [tsg1.node1_test] license for [shield] - valid
[2015-10-26 12:09:04,369][ERROR][shield.license ] [tsg1.node1_test]

I have only one output:

output {
elasticsearch {
host => "localhost"
protocol => "http"
user => "user"
password => "xyz"
sniffing => true
}
}

Also, I have tested the following variables related to the JVM heap memory and all are set as per the recommended settings: (http://mrzard.github.io/blog/2015/03/25/elasticsearch-enable-mlockall-in-centos-7/)

$ grep MAX_LOCKED_MEMORY /etc/sysconfig/elasticsearch
MAX_LOCKED_MEMORY=unlimited
$ grep mlock /etc/elasticsearch/elasticsearch.yml
bootstrap.mlockall: true
$ grep LimitMEMLOCK /usr/lib/systemd/system/elasticsearch.service
LimitMEMLOCK=infinity

Any thing to change or add in the config?


(Fatema Bannat Wala) #8

Also, when I am running Kibana on ES, I am getting events every minute but the rate is really low, like 250-350 events/ 30 sec, and no error logs in ES, and logstash is having same error messages as stated above ""Lumberjack input: the pipeline is blocked, " so if the pipeline is blocked how come I am getting new events every 30 seconds on Kibana? Also, whenever I stop LSF, I stop getting any events during that period in Kibana.

So the question is that, even though LS is telling LSF that the pipeline is blocked and in LSF error logs it is telling that "Read error looking for ack: EOF", how I am keep getting events in Kibana with such a low rate and it stops getting events when the LSF is stopped?

The event processing is very low (300/ 30 sec) right now, is there any way I can flush the current pipeline in LS so that it can start connecting to LSF to get new events on higher rate? doing this will solve the problem? or should I rollback to 0.3 version of LSF?

Any suggestions?

Thanks.


(Fatema Bannat Wala) #9

I figured out the problem.....
The problem is with my filters, when I have filters in place the input stalls and hence ES events rate is low. But when I replaced all of my filters, it worked just fine (100K events/minute). But now the output is unparsed.

Finally, is their any way to get this working when filters are in place? DO I need to increase the Filter Worker threads, and if so then how can I do that on the same machine on which the LS is running (I don't want to increase the servers just run LS instances on them)? Any suggestions???


(Magnus Bäck) #10

What kind of filters do you have? Are you saturating the CPUs or are you using the dns filter and experiencing slow DNS lookups? With "normal" filters event rates of several hundred per second shouldn't be a problem even with a single filter worker thread.


(Fatema Bannat Wala) #11

Yeah, I started putting filters one by one in place and watched whether it clogs the pipeline, but they didn't. Out of six filters when four are in place the ELK works just fine without any errors, but when I add the other two filters then it starts blocking the pipeline and I start getting the "Read error looking for Ack" error in LSF and "input pipeline blocked" error in LS.

So I am assuming that something is going wrong while parsing the events through those two filters, and that's why it is slowing down the events parsing rate of LS, clogging the input pipeline of LS. I will furthermore investigate why LS is behaving differently for different filters and then update here.

Also, I have plenty of memory and cpu on the machine, I have 16 cores and 64G memory.
logstash is using around 104% ( means it is using a little more than 1 core) CPU and elasticsearch is using ~22G of mem (which is as per the settings). So resources should not be a problem here.

Thanks.


(system) #12