Filebeat 5.0.0-alpha5 multiple harvesters for same file sending the same messages over and over

I am re opening this subject because it is also still an issue in alpha5.

Using filebeat witb Kafka. Very happy with performance.

Kafka disk filled up and Kafka stopped ack of lines received.

Filebeats continued to open multiple prospectors until limit of 1024 was reached. Then errors.

When Kafka was rebuilt with more disk capacity it began to receive many duplicate lines. As much as 2.4 million duplicates of one particular line. After many hours still receiving millions of duplicates. 40 million line in Kafka.

Had to add tail_file parameters to filebeats config and delete filebeats registry to recover.

We have about 1000 AWS instances feeding about 6 million lines per day into Kafka.

This is a new environment, yes it is production.

Great work on Beats! We are liking what we see!

Let me know if I can help more.

I should have mentioned Filebeats 5,0 alpha 5 compiled about 10 days ago from source

You wrote until 1024 prospector was reached? Did you mean harvesters? How many files do you have? Can you share your config file?

Just a few days ago we added the config option harvester_limit which allows you to limit the number of open files: https://github.com/elastic/beats/pull/2417

It is not in alpha5 and only available in the nightly builds: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

What kind of errors did you see on the filebeat side?

Thanks for the quick reply..

To answer your questions briefly (it was the middle of the night when I
opened this ticket):

The output of lsof (which I no longer have) showed that the harvester had
the same file(s) open multiple ties. Moslty rails logs files which number
in the tens and should have only about 10 active at one time via 10
Passenger procs

The config is very standard (I will add later) monitoring /var/log/messages
/var/log/cron /var/log/httpd and /var/log/rails on an AWS brand Linux
(Centos) system. I saw the post about harvester_limit (my beats are too
old) but I do not think it would solve my issue.

I believe I may have figured out what my specific recovery problem was, and
I will open a new ticket on that subject.

To try to put a sharper point on this issue, this may be described:

  1. Beats is running well, delivering about 4 million log lines a day to
    Kafka.
  2. Kafka disk fills up and Kafka stops accepting log lines (we also saw
    this once earlier when Kafka endpoint was wrong, so it may just be a
    non-functional endpoint).
  3. Beats opens too many files and keeps them open, often opening files that
    it already has open until it reaches the ulimit (1024 in my case).

The recovery is possibly a different issue. I will explain separately.

Obviously 3) should not happen. The expected behaviour is that filebeat just stops sending data and keeps the files open (which can lead to full disks). If files are opened multiple times, something goes wrong obviously and this should not happen. Are you sure it is not opening new files which are generated during the time the output is blocked? It could be that filebeat keeps retrying sending the same line and it never gets acked by Kafka. So that would explain the exact same line. Can you verify that filebeat really opens the same file multiple times and not new files with the same name? Could you also check the most recent snapshot if you still see this issue? https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

Thanks again for your reply and attention...

The event (and logs) are long gone. I can just speak from memory.

I do know that we have the prospectors set thus:

/var/log/cron --> not a problem
/var/log/messages --> not a problem
/var/log/httpd/* --> Not sure on this one
/var/log/rails/* --> This one is a problem for certain

Rails is run under Passenger. Passenger spawns multiple processes and our logging outputs to a log file identified by the Passenger PID ( ex: 56032.log ). This is to prohibit the interleave of multi-line logs by rails.

As the Passenger procs come and go, a multitude of log files are created. There may be a few hundred created over the course of a day and once the Passenger proc dies, the file does not get updated.

All is good when the Kafka endpoint remains available. The problem arose when Kafka stopped accepting input.

Earlier, we had a similar incident. We configured filebeat to output to a Kafka endpoint that did not even exist yet. The same problem arose, where filebeat began logging errors that too many files are open.

Reviewing open files with lsof showed that there were multiple filebeat PIDS with the same log file open. Something went awry.

I am taking steps to catch filebeats in the act of duplicate open files. If I catch one, I will add more detail.

For the most part, I can keep our Kafka healthy, so I do not anticipate this happening again. I am writing this to help others who may run into this experience and perhaps help you narrow in on an obscure edge case.

Thanks again for your help!

I can confirm seeing multiple harvesters open to the same file as well with 5.0.0 alpha 5. We don't have kafka in the midst in our setup but we definitely saw this when globs were used to define the sources. I will try to reproduce and report back.

@viveklak @Tim_Burt Do you have any symlinks in your log files?

No symlinks. I tried a bit to reproduce. As such the problem doesn't seem to be necessarily the globing itself but the potential raciness of the prospector scan in the presence of large number of files.

For instance, if you have ~250,000 files in the directory, the glob itself can take 5-6 seconds to return, followed by the latency in processing any new files/updates. As a result, looks like you could have concurrent invocations of https://github.com/elastic/beats/blob/master/filebeat/prospector/prospector_log.go#L56 (Run()) which may end up starting multiple harvesters. Once this problem starts happening, things quickly degenerate and can result in FD exhaustion.

I made a quick/potentially naive change to protect this and tested it by applying to 5.0.0 alpha 5 and I didn't see the same issue emerge. The change is here: https://github.com/elastic/beats/pull/2539
Let me know if you'd like me to open a ticket for this.

Negative, no symlinks and nowhere near that number of files (more like less than 30).

However, the statement:

_As a result, looks like you could have concurrent invocations of https://github.com/elastic/beats/blob/master/filebeat/prospector/prospector_log.go#L56 (Run()) which may end up starting multiple harvesters. Once this _
problem starts happening, things quickly degenerate and can result in FD exhaustion.

Looks very interesting.

My issue seems to revolve around a Kafka endpoint that is reachable but does not ack due to conditions on the kafka instance. This could perhaps inspire the multiple invocations?

@viveklak Interesting observation. I left some comments on the PR. I think it concurrent invocation of Run should not be possible but there could be a potential other problem (see PR comment).

@viveklak @Tim_Burt I have a suspicion on how this also relates the the blocking Kafka output. See https://github.com/elastic/beats/pull/2539#issuecomment-246805786

@viveklak @Tim_Burt Will try to provide a test and a potential fix tomorrow.

This is probably water under the bridge, but in the interest of providing as much info as I can, I will provide additional details.

Last night, in my Kibana, I see approx 125,000 identical log entries:

Note: There were no Kafka anomalies to accompany this event although the filebeat logfile is claiming that there is no ACK. The "no ACK" message is common to all filebeat logs however, and is not specific to this case.

Again, at nearly the same time, another line duplication:

Same web server, different log line.

There is only a single entry in the http log file that corresponds to this output.

I have a Kafka consumer that simply consumes the topic and dumps to S3. The duplicate lines show up in S3 as well.

Typical lines from the filebeat log file at the timeperiod in question (note TZ diff). The lines like:
2016-09-13T18:17:49Z INFO Harvester started for file: /var/log/rails/3167.log
Repeat many many times over this period. Same log file name.

016-09-13T18:17:05Z INFO Non-zero metrics in the last 30s: filebeat.harvester.running=6 libbeat.kafka.call_count.PublishEvents=654 libbeat.kafka.published_but_not_acked_events=105294 filebeat.harvester.started=6 filebeat.harvester.open_files=6
2016-09-13T18:17:09Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:09Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:09Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:09Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:17:19Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:19Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:19Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:19Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:17:29Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:29Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:17:29Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:29Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:35Z INFO Non-zero metrics in the last 30s: filebeat.harvester.open_files=6 filebeat.harvester.started=6 libbeat.kafka.published_but_not_acked_events=108836 filebeat.harvester.running=6 libbeat.kafka.call_count.PublishEvents=676
2016-09-13T18:17:39Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:39Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:17:39Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:39Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:41Z WARN client/metadata fetching metadata for all topics from broker ip-172-31-90-220.us-west-2.compute.internal:9092

2016-09-13T18:17:49Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:49Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:49Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:49Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:17:59Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:59Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:17:59Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:17:59Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:18:05Z INFO Non-zero metrics in the last 30s: libbeat.kafka.published_but_not_acked_events=103523 filebeat.harvester.open_files=6 libbeat.kafka.call_count.PublishEvents=643 filebeat.harvester.running=6 filebeat.harvester.started=6
2016-09-13T18:18:09Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:18:09Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:18:09Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:18:09Z INFO Harvester started for file: /var/log/rails/3167.log
2016-09-13T18:18:19Z WARN DEPRECATED: close_older is deprecated. Use close_inactive
2016-09-13T18:18:19Z INFO Harvester started for file: /var/log/rails/16040.log
2016-09-13T18:18:19Z WARN DEPRECATED: close_older is deprecated. Use close_inactive

Hope this helps......

I created today a PR to fix the issue: https://github.com/elastic/beats/pull/2541 It was merged a few hours ago and should now be available in the most recent snapshot: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

Any chance one of you could try if this solves the issue?

Glad to...

Do you mean 6.0.0-alpha1 ??

2016-09-14T16:11:14.000Z 7991463 filebeat-6.0.0-alpha1-SNAPSHOT-i686.rpm

Thanks for your help on this one... Greatly appreciated.

Deploying 6.0.0-alpha1 to my QA environment. (compiled from github source master branch)

I will keep you posted.

NOT GOOD.
Getting garbage.

Still checking. but not good so far.

:frowning: Can you share some details on the garbage you get?

BTW: 6.0.0-alpha1 is currently the one built from master. We had to do some version changes as 5.0.0-beta1 is coming soon, but fixes are definitively also going to be part of 5.0