Duplicates events when file rotation occurs - Windows


(Bruno Lavoie) #1

Hello,

We're in some strange behavior with our Filebeat.

Here the facts:

  • filebeat 1.2.3
  • windows server (2012 r2 I think)
  • we use log4j2 for logging
  • current/active logging file is named like this: acces.log, audit.log, application.log
  • each file is rotated based on size (25mb) or day change
  • when rotated, file moved to subfolder named YYYY-MM, and filename prefixed with YYYY-MM-DD and appended %i where %i is a daily sequence number
  • each file is rotated when we cross the day at midnight, so each file contains only event from its tagged day in name

Last week we were hit by file modification date not changing for hours or at worst not changing from 00:00 (midnight).

Some search brought us to this known problem:

So we increased our close_older setting from 1h to 30h to easily cover our worst case of single daily moddate (00:00).

Also, we always ensured that our globs covers current files as rotated ones.

One of our prospector config:

filebeat:
  prospectors:

    -
      paths:
        # CURRENT FILES: d:\logs\[domain]\[managed server]\[application name]\
        - d:\logs\*\*\*\acces.log
        # ROTATED FILES: d:\logs\[domain]\[managed server]\[application name]\YYYY-MM\
        - d:\logs\*\*\*\*\*-acces-*.log

      document_type: acces

      encoding: iso8859-1
      input_type: log

      ignore_older: 168h
      close_older: 30h
      scan_frequency: 10s
      force_close_files: false

We though it will fix everything, but looking at logs surrounding midnight I can see a lot of event sent when rotation occurs. Our spool size (4096) is flushed many times, a sign that something odd is occuring.

Here's the filebeat log at midnight: https://gist.github.com/blavoie/9b060b918cb4bee8fe78c86df63a441d

When looking at Elasticsearch, I can see duplicates for the same events.

Easy to confirm the duplicates as:

  • we're on very low activity at midnight, and spool size never hit the day on our summer load
  • events are timestamped at milisecond level

So, is it a bug or misconfiguration on our side?

Thanks
Bruno


(ruflin) #2

Thanks for sharing all the details. I had a quick look at the log:

Not resuming rotated file:: This is good even though it is a little bit misleading. It means it found the same filename again, but it is a new file (which I think is the case.

What worries me are the following lines:

2016-06-20T00:00:06-04:00 INFO Harvester started for file: d:\logs\mpo-prod\mpo-web2-cluster-ms1\set\2016-06\2016-06-19-application-1.log
2016-06-20T00:00:06-04:00 INFO Harvester started for file: d:\logs\mpo-prod\mpo-web1-cluster-ms1\mpo-portail\2016-06\2016-06-19-application-1.log
2016-06-20T00:00:06-04:00 INFO Registry file updated. 592 states written.
2016-06-20T00:00:06-04:00 INFO Registry file updated. 593 states written.

First it starts a new harvester on a file it should already know and then it adds one more state to the registry file (instead of updating the old one). My initial guess here is that somehow the identifiers changed. How do you exactly move / rename the files? Can you check if the identifiers and the device id are identical after moving the files?


(Bruno Lavoie) #3

Thanks for your fast reply.

I asked operations to send me registry file this morning, and I'll do the same this afternoon and tomorrow morning to check files identifiers.

Current FileStateOS of my audit.log file:

      "FileStateOS": {
         "idxhi": 327680,
         "idxlo": 24804,
         "vol": 1549167686
      }

Is the "vol" value the most important one? (kind of inode under windows?)

I don't know exactly how rotation works, as it uses LOG4J2.

Rollover strategies in documentation https://logging.apache.org/log4j/2.x/manual/appenders.html#RolloverStrategies always talks about renaming...

I'm not developer on this but can ask for pointers...

I can't imagine that it does anything else than a rename, a copy incurs a lot of overhead.

Bruno


(ruflin) #4

Under windows the combination of idxhi, idxlo, vol must be unique (on linux it is inode, device). So in case one of the three changes, filebeat assumes it is a new. Based on your config I would assume the volume id should not change. Is d: a shared/mounted drive or a local disk?


(Bruno Lavoie) #5

Filebeat is deployed locally on each server, then D: is a local drive.

Remote mounts are hell..... :smiley:


(Bruno Lavoie) #6

Hello,

Updates from this morning, at first, everything seems ok.
I have compared registrar files from yesterday and this morning

From yesterday registry, current audit.log file:

   "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\audit.log": {
      "source": "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\audit.log",
      "offset": 485303,
      "FileStateOS": {
         "idxhi": 327680,
         "idxlo": 24804,
         "vol": 1549167686
      }
   },

From today registry, current audit.log file (state info changed):

   "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\audit.log": {
      "source": "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\audit.log",
      "offset": 22887,
      "FileStateOS": {
         "idxhi": 720896,
         "idxlo": 24914,
         "vol": 1549167686
      }
   },

From today registry, rotated file kept the same state informations (state info hasn't changed on rollover):

   "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\2016-06\\2016-06-20-audit-1.log": {
      "source": "d:\\logs\\mpo-prod\\mpo-web1-cluster-ms1\\mpo-audit\\2016-06\\2016-06-20-audit-1.log",
      "offset": 809742,
      "FileStateOS": {
         "idxhi": 327680,
         "idxlo": 24804,
         "vol": 1549167686
      }
   },

As always, we can see a lot of filebeat activity at midnight:

First: are my settings appears OK?
Second: should I enable debug logging? Which level? how?

Thanks
Bruno


(Bruno Lavoie) #7

Hello @ruflin, any new tips on this?


(ruflin) #8

@Bruno_Lavoie Sorry for the late reply. In general everything looks fine. To enable debug logging, you must start filebeat with -e -d "*".

It seems like for the file you posted above, also the harvesting didn't continue (no duplicated events). It would be interesting to compare the state for one that did, for example 2016-06-20-application-1.log.


(Bruno Lavoie) #9

Ok,

I activated debug level logging and this makes a lot of output because a lot of Publish output.
Any way to disable only publish events?

Also, we have multiple 2016-06-20-application-1.log files, from multiple applications.
They all seems to have the same problem.

Look for yesterday:


(ruflin) #10

There are different selectors which can be set for the debug output. I set all above as I'm not sure yet which one will be needed. Unfortunately it is only whitelisting and not blacklisting.

Any chance to share the ids for a file that has duplicates as you did previously above?

Alternatively you could try 5.0.0-alpha3 to see if there is still the same behaviour. Quite a few improvements were made in the registry management.


(Bruno Lavoie) #11

Hello,

Sorry for delay, my turn!

I've put in place scheduled jobs to capture registry and log files automatically before and after midnigth.

Also, I put my efforts on a acces.log for which I have a unique request identifier for each line. With this I can be out of doubt about duplicates.

When crawling logs, they're all duplicated, because it happens at each file rollover.
Yesterday afternoon:

Rollover from yesterday to today (midnight).

2016-06-29 23:59:45 Snapshot

Current

"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\acces.log": {
    "FileStateOS": {
        "idxhi": 851968,
        "idxlo": 26050,
        "vol": 3995102152
    },
    "offset": 12258620,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\acces.log"
},

Rotated

"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-1.log": {
    "FileStateOS": {
        "idxhi": 65536,
        "idxlo": 26032,
        "vol": 3995102152
    },
    "offset": 26214552,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-1.log"
},
"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-2.log": {
    "FileStateOS": {
        "idxhi": 983040,
        "idxlo": 26044,
        "vol": 3995102152
    },
    "offset": 26214523,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-2.log"
},

2016-06-30 00:00:15 Snapshot

Current file

"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\acces.log": {
    "FileStateOS": {
        "idxhi": 720896,
        "idxlo": 26060,
        "vol": 3995102152
    },
    "offset": 5352,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\acces.log"
},

Rotated

"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-1.log": {
    "FileStateOS": {
        "idxhi": 65536,
        "idxlo": 26032,
        "vol": 3995102152
    },
    "offset": 26214552,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-1.log"
},
"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-2.log": {
    "FileStateOS": {
        "idxhi": 983040,
        "idxlo": 26044,
        "vol": 3995102152
    },
    "offset": 26214523,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-2.log"
},
"d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-3.log": {
    "FileStateOS": {
        "idxhi": 851968,
        "idxlo": 26050,
        "vol": 3995102152
    },
    "offset": 90461,
    "source": "d:\\depot_journal\\mpo-prod\\mpo-web1-cluster-ms3\\mpo-portail\\2016-06\\2016-06-29-acces-3.log"
},

Then Info Log
https://gist.githubusercontent.com/blavoie/3a5eb40d0924b06d8a4d24e472475317/raw/ed39f52a6bd0ff87baf336c8e733a9367734f8c3/info-log

Am I dreaming that file state doesn't change on rotate?

I'm trying to reproduce with a tiny program using same logging library to see if I can reproduce on my workstation.
If not, then I'll try to convince OPs to put latest version.

Thanks, more on that later.
Bruno


(Bruno Lavoie) #12

Hello,

Good news, I've reproduced the behavior under Win7 with my little sample program.
Then, with 5.0.0-alpha4 this does't duplicate data on rollover.
:slight_smile:

If you want code to reproduce:

I will send the latest binary file to ops to make the update on each servers.

Just to be safe, as stated here:

And seen here:

The harvester doesn't monitor file based on moddate but now on size?
Is it safe to make close_older to something more appropriate? 30h --> 1h

Thanks
Bruno Lavoie


(ruflin) #13

Hi Bruno

I still have to take a closer look into the data above to better understand why this happened with alpha3 and not alpha4. I just wanted you to briefly notify about this issue here as I felt it could be related: https://github.com/elastic/beats/pull/1954 (not merged yet).

About your other questions will get back to you later / tomorrow.


(ruflin) #14

@Bruno_Lavoie I now had a closer look. I still don't understand to 100% what caused the issue above and how it was resolved in alpha4.

The size of the file is now used to decide if a harvester should continue harvesting a file or not. This was missing previously so I have the suspicion that this caused some double harvesting.

It is save to set close_older also to a much lower value, like 1 minute or 5 min. We are currently discussing it to set it a lower default value. The only side affect is that file which were closed and changed, will only be picked up again after scan_frequency. But normally log files which were closed, do not change over time and are only rotated.

Thanks for doing all the investigations. Please ping me if the problem occurs again.


(Bruno Lavoie) #15

Hello,

The bug is reroductible with FB 1.2.3, not with latest alpha.
Sorry for the confusion.

BL


(ruflin) #16

@Bruno_Lavoie It seems like your issue is the same that was fixed here: https://github.com/elastic/beats/pull/1954 It is not part of the nightlies: https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/

The problem is that if a file rotation happens during a scan, there is some race between the file state and the file path. The probability that this happens on 1.2.3 is much higher then it was on 5.0.0-* because the state is read much more often. With the most recent fix it cannot happen anymore, because after opening the file, it is checked if the correct file was opened.


(Bruno Lavoie) #17

@ruflin
Such un nightmare for you all this logrotate stuff!

On the good news: using FB5-alpha4 doesn't send duplicate events on rollover anymore. :slight_smile:

IIRC, there's a chance of problems on rollover?
As you mentionned about https://github.com/elastic/beats/pull/1954, it's not in current alpha builds?

Target version for theses fixes?

Thanks again for you support


(ruflin) #18

Glad to hear it works now. There is still a chance for the race condition in alpha4, but to reproduce it I had to rotate 100 files multiple times in 1 second :slight_smile: It should completely fixed in the latest nightly build https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/ Let me know in case you hit any issues with the nightly build.

This is planned to ship with the 5.0.0-beat1 release in a few weeks.


(system) #19

This topic was automatically closed after 21 days. New replies are no longer allowed.