Filebeat rc1 resends full data set data upon random restart (registry file not updated properly)?

shog · October 21, 2016, 8:00am

Hi,

Recently I've reported Filebeat beta1 resends random data upon every restart (registry file not updated properly?) where we didn't really reach a clear conclusion, but I was able to mitigate problem by removing clean_removed: true. But now it looks like there is another bug in a matter:

Repeatable scenario after upgrade to rc1:

several boxes, all with the same configuration deployed by configuration management,
filebeat is restarted at all of them by daily cron, at roughly the same time in the day,
files in prospector paths meet the same criteria at all servers (similar volume, similar number of files, no files older than X (where is X is <10 ) hours)
after restarts filebeat instance at RANDOM server might decide to rescan/resend entire data set from scratch. All other servers/filebeats behave properly, I see no key here.
there is nothing in configuration that could trigger rescan. there is only:

      clean_removed: false
      ignore_older: 25h

Few additional facts:

dataset processed by every filebeat instance is rather huge; few dozens of GBs collected from few thousands of files,
multiple prospectors in use, but not overlapping
daily cron can call "service filebeat restart" twice in very short time, one after another,
servers are rather busy,
just by hoping to them I've noticed that sometimes registry file is not updated for several minutes, despite thousands of events being processed by said node,

Any help appreciated!

tudor · October 21, 2016, 10:52am

It sounds like the registry file not being updated is a direct cause for duplicates on Filebeat restarts. Can you grep the logs for any "registrar" errors?

Do you have your instances in debug mode? If not, what would help is to add -d registrar which will enable debug mode for the registrar only, which might help us figure out what went wrong next time it happens.

How does the rest of the system looks like? I'm interested because there are scenarios where what you are seeing would be expected. For example: Filebeat writes to the registry file only after the next stage (usually Logstash) confirmed that the data was received. If for whatever reason these ACK are arriving very late, and Filebeat is restarted just before the ACK is received, it will re-send that part of the log files.

Sorry if you already provided this info in other tickets, i wasn't closely following all of them.

shog · October 21, 2016, 11:37am

hi @tudor,

It's truly multiple filebeats -> cluster of logstashes -> ES.

All filebeat nodes talk to the same set of logstashes (with loadbalance: true and publish_async: true). I have logstash,output.timeout in filebeat increased to 300s and congestion_threshold set to 90 sec in beats input on LS side.

Unfortunately logs itself very already rotated out, but I'll try to catch this problem tomorrow. But from what I was able to check when I spotted this issue there was completely no registrar logs -- so this is very similar to what I've reported in Filebeat beta1 resends random data upon every restart (registry file not updated properly?)

One ultra important thing however: as being said, it only happens on some (usually single) filebeat nodes and we're talking about entire data set being replayed (it's always full set) -- in case of today that was data from last 8hrs worth 30M of events, so I just have virtually no capacity to "buffer" that amount of events on logstash side without acking.

Also before filebeat node was restarted all events coming from it were processed and stored in ES without any lag (in less than 2s).

ruflin · October 24, 2016, 3:04pm

Is there a chance that you could copy the state of the registry file to a second file between the restarts? The above could be explained if the registry file would be corrupt (for whatever reason). It kind of reminds me of https://github.com/elastic/beats/pull/2546, but this is windows and winlogbeat.

We also found an issue in the migration of old states: https://github.com/elastic/beats/pull/2830 But this is more related to not clean states from the registry file.

What is the size in KB of your registry files? How many entries do you have inside? Any chance to share some log files (only INFO level is also fine) for example via email?

Out of curiosity: What is the reason you restart filebeat that frequently? If you want filebeat to shutdown nicely (not having duplicated events) you should also have a look at the shutdown_timeout config option.

shog · October 24, 2016, 3:35pm

Hi @ruflin

My registry files are actually quite huge as each filebeat operates on 2k new files created every day. The biggest registry file I see on rc1 cluster is almost 5MB big and it has a state fo 20k files inside.
Because of other issue I've encountered in Filebeat beta1 resends random data upon every restart (registry file not updated properly?) - #7 by shog I'm only using clean_inactive: 36h, with clean_removed: false,
As you can guess from my registry size I have tons of remarks of old files that are long gone, but filebeat is still complaining about them with "$file should have been dropped, but couldn't as state is not finished.". Is there any way to enforce cleanup of those?
I restart filebeat two times daily after my file rotation script finish and I do have shutdown_timeout: set to 60 already. Filebeat never requires that long to stop.
I will try to prepare data and share it over email.

I believe I have a nodes where I fully nuked registry file and above issue occured again.

ruflin · October 25, 2016, 7:40am

As you already use shutdown_timeout I would not expect filebeat to be interrupted in the middle of writing the registry file. So size should also not matter here. The issue I linked above should solve your "should have been dropped" issue. That will mean also your registry file will become smaller hopefully.

The part I still don't understand is why do you need to restart filebeat after file rotation?

For the cleaned registry files: Did you already test the most recent nightly build on it?

ruflin · October 25, 2016, 7:42am

Side note: Lets also dig deeper again into the clean_removed issue as this should definitively work and if not we must get it working / fixed.

shog · October 27, 2016, 11:30am

FYI this is happening with 5.0 GA too. I even started with clean registry on my test cluster and made a filebeat restart few hours later -- 40% of nodes started to rescan data from beginning.

shog · October 27, 2016, 11:51am

Actually situation looks just bad now. All from 7 test nodes running has stopped updating registry file completely shortly after they were restarted. All nodes operate completely fine except that, shipping millions of events, but I guess they gonna resend everything upon next restart.

Current time: 11:47 UTC,
node1: process restart at 05:54, last on-disc registry file update: 06:00
node2: process restart at 08:30, last on-disc registry file update: 08:35
node3: process restart at 05:05, last on-disc registry file update: 05:06
node4: process restart at 07:32, last on-disc registry file update: 07:32
node5: process restart at 11:20, last on-disc registry file update: 11:21
node6: process restart at 05:40, last on-disc registry file update: 05:50
node7: process restart at 04:43, last on-disc registry file update: 04:47

Could it be that previous instance of filebeat was still "locking" registry before it was fully stopped and new one couldn't access it properly?

ruflin · October 27, 2016, 1:01pm

That doesn't sound good

I just opened a PR a few hours ago to exit in case there are problem with loading the state: https://github.com/elastic/beats/pull/2868 Improved error handling for the registrar is already in the master branch but it didn't make it into 5.0.0 release.

Any chance you could test the current nightly builds? https://beats-nightlies.s3.amazonaws.com/index.html?prefix=filebeat/ The reason I ask is because this could give us some more indication on what actually goes wrong with the registry.

Is there a chance that more then once instance of filebeat are running at the same time on the system?

BTW: I still didn't understand why do you restart filebeat so often?

shog · October 27, 2016, 3:18pm

Not intentionally. I have no configuration to runmultiple filebeat instances.

But what I've observed is :

I have shutdown_timeout: 60s
"service filebeat restart" needs just a couple of seconds to finish (5-6sec?)
after "service filebeat restart" there is some time where two filebeat instance are visible running.
"The old" one is finally shutting down itself after few dozens of second (maybe longer sometimes).

also possible to repro with stop i.e.:

# service filebeat stop
Stopping filebeat:                                         [  OK  ]
# pgrep filebeat
9666

I assumed this is expected behaviour. But is it?

Well, I do that twice a day max (or upon config change, but that's quite rare). Anyway this shouldn't be a crime punished by entire dataset being re-imported

I will try with nightlies.

shog · October 27, 2016, 3:29pm

Also my registry file has all files with "ttl":-1000000000"
Is this correct?

shog · October 27, 2016, 5:55pm

Ok, this is getting really interesting. Installing nightly build resulted in insane memory consumption (several dozens of gigabytes) and no events being delivered at all. Pretty much similar to what I've seen here: Filebeat beta1: Insane memory footprint

Would you be so kind to verify the config file that I've sent you yesterday?
Maybe there is some value being misrepresented there?

ruflin · October 28, 2016, 7:43am

I had tonight an other idea what could cause these issues: What is the ulimit (max of open files) for the user that is running filebeat. Could it be that you hit here a limit?

Two filebeat instances: Is it really 2 filebeat instances or filebeat and filebeat-god? If two filebeat instances are running at the same time, this is definitively not good.

About the restart: Definitively it should work and not be punished I understand the restarts for config changes (especially at the moment with these issues) but it seemed like you automated the restarts for some reason? If there are no config changes, what is the benefit of the restart?

ttl:-1000000 is expected. It will be -1 and -2 in future release (there is PR to change that).

About the config file: One thing that kind of suprised me is max_rpocs: 19. Do you have that many CPU (if yes, cool ). You are using publish_async: true and loadbalance: true and quite a few logstash hosts. Based on your configuration which is optimized for "speed" and less memory usage I assume you expect quite a lot of events per second? How many log lines per second do you have (just a rough estimate).

What do you see in the logs of the latest nightly build?

shog · October 28, 2016, 12:31pm

There is ulimit -n 16384 injected in start() section of /etc/init.d/filebeat and comparing this to /proc/$PID/fd it doesn't seem like Filebeat is anywhere close to that limit. Yet, I do not have any real-time monitoring on this, but some ad-hoc checks don't show anything weird.
Instances: two filebeat instances are visible for some time after ~every restart. Like I said before service filebeat restart itself takes just few secs, despite having shutdown_timeout: 60s in the config. Shouldn't it wait until that instance old instance is fully stopped? I can repro this every time with both rc1 and GA.

# pgrep -l filebeat
29213 filebeat-god
29214 filebeat

# time service filebeat restart
Config OK
Stopping filebeat:                    [  OK  ]
Starting filebeat: Config OK          [  OK  ]

real	0m3.309s
user	0m0.067s
sys	0m0.054s

# pgrep -l filebeat
29214 filebeat
29563 filebeat-god
29564 filebeat

Note: anecdotal observation is that GA requires more time to stop this old instance than rc1.

max_procs -- I'm assuming this is logical CPUs, if so then yes.
The aim of this config was to sustain at least 1.5k events/sec per filebeat host, with some peaks on top.
I will give nightly another spin to grab some logs.

ruflin · October 31, 2016, 8:33am

Hitting the ulimit issue could have explained some wired behaviour with the registry file, but that doesn't seem to be the issue
About the restart: I need to check that in detail. There should not be two instances at the same time.
About restart time: Strangely there were no changes to this part of the code between RC1 and GA?
max_procs: Cool
1.5k events / sec should not be an issue as long as there are no complex multiline patterns
Looking forward to some more logs.

@shog I really appreciate the effort you put into this to find the issue. Big thank you.

ruflin · October 31, 2016, 9:04am

About the two processes: This is probably caused by the init script in combination with shutdown_timeout. It only kills filebeat-god and does not wait long enough for filebeat to also completely shutdown when shutdown_timeout is used. This leads to 2 instances writing / reading from the same registry file which is not good at all, especially on startup

Any chance you could test to restart filebeat without the init script and see if the behaviour persists? It would be nice if we could have something like TimeoutStopSec= in systemd for these init scripts, unfortunately each platform is a little bit different.

shog · November 2, 2016, 3:18pm

@ruflin

Yeah, it looks that's the case. I've modified restart to ensure that new instance won't start until old one finished and I haven't seen any occurrence of this problem anymore.

ruflin · November 3, 2016, 8:37am

@shog Thanks for testing and great to hear.

Can you open a Github issue for the overlap restart issue or should I open one?

shog · November 8, 2016, 4:17pm

Sorry, missed your comment. I just did: https://github.com/elastic/beats/issues/2961