File input plugin always leaving files behind

Hello there.

Frankly, I've given up on the S3 input plugin entirely. We are attempting to ingest ELB access logs via S3. They're always a little behind anyway in AWS - by about 5 minutes - so it's not possible to keep them real-time anyway. We had a lot of serious issues with the S3 input plugin, so I figured I would just develop a quasi-custom solution and drop them into a local filesystem directory -- normal old directory within an ext4 filesystem -- when they're ready to be processed, which is what I'm doing now.

However, it doesn't matter. There are always, ALWAYS files left behind after some period of running. At first i thought it was due to the amount of workers I was using (-w 32), so I dropped it to just two workers after we'd processed several days of data, to keep up with the new data moving forward.

Still, files would be left behind. After debugging, I discovered these files would be listed within the "since db" but not processed. I am using grok + the metadata provided by the file input plugin to dump the filename as a field, and sure enough, it would be in the "sincedb", but nothing output to elastic or elsewhere (and no errors from logstash or elastic).

Finally, I tried to make it as easy as possible for logstash to get it right, with the following config:

  file {
        type => "elb-logs"
        sincedb_path => "/dev/null"
        file_completed_action => "log_and_delete"
        file_completed_log_path => "/home/admin/logstash/x-ingested.log"
        file_sort_by => "path"
        mode => "read"
        path => "/home/admin/logstash/incoming-logs/**/*.gz"
        start_position => "beginning"
  }

Still, ... as I am writing this, logstash has again left behind 7 files. Ctrl+C and then restarting logstash using the exact same configuration, it will then suddenly "see" the previously missed files. It certainly isn't a glob/matching issue. The sincedb is disabled. What on earth could this be?

Small update: Just wanted to add, as I mentioned, restarting logstash results in instant processing of the files. This happens nearly every few "batches" of files which occur in approx 5 min intervals. All files are named uniquely (and if they weren't, by some issue of AWS's, they would be in the file_completed_log_path, which they are not).

I attempted to touch the files, to see if logstash would pick them up, then. No go. I just changed the max_open_files value to a very low value of only 10, since it's only 20-30 files needing processed at a time for the moment. If that helps any, I'll update again, ... But this is becoming quite the pain.

What are the offsets in the sincedb entries? How do they compare to the size of the file?

I am no longer using the since_db, I've set the sincedb option to /dev/null. I was simply stating that, at one point -- before I set it to /dev/null (which was just about an hour before posting this, previously I had it enabled), that the skipped files were showing up in the sincedb, despite not being read.

I don't see how it will help, however, I'll re-enable it, I guess, and get the offsets, ... The change to max_open_files didn't help.

It shouldn't take long, in the ~2 hours I have posted this, it has since accumulated 52 skipped files.

Whoah, here is some more interesting details!

One, ... Moving them to a new directory while the current logstash instance is still running doesn't help at all. That is, ... the directory structure is such that all files/folders are under a single directory, under the glob'd directory incoming-logs/ called elb-access/. So, just to see, I did:

mkdir incoming-logs/reprocess/
mv incoming-logs/elb-access/ incoming-logs/reprocess/elb-access/

To see if this changed anything, and it did not. Which leads me to believe it's excluding it based on some stat() related factor, such as ctime/mtime/atime, etc.

For reference, here are some of the files, with our AWS account ID blanked, -- after I had performed the above mv and with some IDs replaced with similar text:

incoming-logs/reprocess/elb-access/the-elb-name/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2019/06/24/123456789012_elasticloadbalancing_us-east-1_app.the-elb-name.elbhashid_20190624T2300Z_127.0.0.1_1ixew534.log.gz
incoming-logs/reprocess/elb-access/the-elb-name/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2019/06/24/123456789012_elasticloadbalancing_us-east-1_app.the-elb-name.elbhashid_20190624T2225Z_127.0.0.1_512uh8x8.log.gz
incoming-logs/reprocess/elb-access/the-elb-name/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2019/06/24/123456789012_elasticloadbalancing_us-east-1_app.the-elb-name.elbhashid_20190624T2210Z_127.0.0.2_1adjc3ex.log.gz
incoming-logs/reprocess/elb-access/the-elb-name/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2019/06/24/123456789012_elasticloadbalancing_us-east-1_app.the-elb-name.elbhashid_20190624T2130Z_127.0.0.1_4r7h9dq6.log.gz

And, a stat() of first one in list:

  File: incoming-logs/reprocess/elb-access/the-elb-name/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2019/06/24/123456789012_elasticloadbalancing_us-east-1_app.the-elb-name.elbhashid_20190624T2300Z_127.0.0.1_1ixew534.log.gz
  Size: 886075          Blocks: 1736       IO Block: 4096   regular file
Device: ca01h/51713d    Inode: 1573688     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/tc-admin)   Gid: ( 1000/tc-admin)
Access: 2019-06-24 19:02:17.875367726 -0400
Modify: 2019-06-24 19:00:04.000000000 -0400
Change: 2019-06-24 19:02:17.875367726 -0400
 Birth: -

With the current server date/time being 19:29:40 at the time I took this stat -- so nearly 30 minutes after it had been dropped.

They are all *.gz, with sizes ranging from 287 bytes to 1,034,850 bytes in this particular circumstance.

It really makes no sense, since a touch + mv to new directory -- should basically make it a "brand new file" in every aspect, as far as I know. Yet, even when combined just moments apart, it doesn't seem to help. But, simply restarting logstash seems to make it suddenly process them instantly.

How can I debug this? Something insane is happening here.

Even this fails to force processing upon those files:

mv incoming-logs/ incoming-logs-tmp/
find incoming-logs-tmp/ -name *\.gz -exec gunzip {} \;
find incoming-logs-tmp/ -name *\.log -exec gzip {} \;
mv incoming-logs-tmp/ incoming-logs/

Which should result in a totally new file hash and everything. If I didn't see logstash processing new files being dropped in, I would say that logstash had crashed or something.

No. The file input needs to track file rotations, which involve name changes, so it tracks inode+device. Neither of these is changed by touch+mv (unless you move it onto a different mount point, i.e. a new device).

The reason I ask about the sincedb. If the offset is equal to the file length then it appears to have read the file but not processed the data. That's one kind of bug. If the offset is zero then it is never reading the file, which is another kind of bug, possibly one in the transitions of the state machine.

If you set sincedb_path to /dev/null then the in-memory sincedb is not persisted to disk and it prevents us ruling out some issues.

We may need to go all the way to '--log.level trace' (not debug) which results in voluminous output from filewatch as it spots new files, decides when to read them, how it updates the sincedb etc.

Are filenames ever re-used? I realize you have a timestamp in there, and a random looking id so (modulo DST changes) I would guess not, but I do not want to make the assumption.

Oh, actually I just thought of something. If your filesystem does not use inode version numbers (and who does?), then if you delete files after processing them then the inode will get re-used when a new file is written. The file input may think that is a renamed version of a file it has already processed. The clue to this happening would be for the offset recorded in the sincedb being larger than the size of the next file that uses the inode. For example, if you drop a 10 KB file into the directory and the sincedb says it has already read 100 KB from it then that could be the problem since that inode then it would most certainly be ignored. In that case setting the sincedb_path to /dev/null plus a restart would fix it (whilst also probably resulting in other files being processed twice). You might want to read through this github issue.

The inode thing has to be the issue.

Here is some debug lines of skipped files found via the trace:

[TRACE] 2019-06-25 14:43:24.495 [[main]<file] discoverer - discover_files handling: {"new discovery"=>false, "watched_file details"=>"<FileWatch::WatchedFile: @filename='123456789012_elasticloadbalancing_us-east-1_app.elb-name.7bdb0490xxxxxxxx_20190625T1835Z_127.0.0.1_2yudlb9g.log.gz', @state='ignored', @recent_states='[:watched, :watched]', @bytes_read='1104048', @bytes_unread='0', current_size='1040624', last_stat_size='1040624', file_open?='false', @initial=false, @sincedb_key='1573908 0 51713'>"}

The file size is 1040624 bytes as said, and being it says it has read 1104048, I am assuming this is due to inodes being reused?

The filenames should NEVER be reused, ever.

Also, I do have my sincedb set to /dev/null (sincedb_path => "/dev/null", to be exact). So what am I missing here?

I really cannot stress enough the frequency of these skipped files. It's a minimum of 1 out of every 100 ends up skipped, -- probably more like 1 out of every 20.

Also, -- immediately after submitting this reply, I went ahead and did a grep on the log for that "sincedb key" of "1573908 0 51713". Here is the result:

[TRACE] 2019-06-25 14:31:52.199 [[main]<file] discoverer - discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='123456789012_elasticloadbalancing_us-east-1_app.elb-name.7bdb0490xxxxxxxx_20190625T1830Z_52.0.160.145_5z5m3j1p.log.gz', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='1104048', last_stat_size='1104048', file_open?='false', @initial=true, @sincedb_key='1573908 0 51713'>"}
[TRACE] 2019-06-25 14:36:23.900 [[main]<file] discoverer - discover_files handling: {"new discovery"=>true, "watched_file details"=>"<FileWatch::WatchedFile: @filename='123456789012_elasticloadbalancing_us-east-1_app.elb-name.7bdb0490xxxxxxxx_20190625T1835Z_34.206.171.194_2yudlb9g.log.gz', @state='watched', @recent_states='[:watched]', @bytes_read='0', @bytes_unread='0', current_size='1040624', last_stat_size='1040624', file_open?='false', @initial=true, @sincedb_key='1573908 0 51713'>"}
[TRACE] 2019-06-25 14:36:23.900 [[main]<file] sincedbcollection - handle_association fully read, ignoring..... {"watched file"=>"<FileWatch::WatchedFile: @filename='123456789012_elasticloadbalancing_us-east-1_app.elb-name.7bdb0490xxxxxxxx_20190625T1835Z_34.206.171.194_2yudlb9g.log.gz', @state='ignored', @recent_states='[:watched, :watched]', @bytes_read='1104048', @bytes_unread='0', current_size='1040624', last_stat_size='1040624', file_open?='false', @initial=false, @sincedb_key='1573908 0 51713'>", "sincedb value"=>#<FileWatch::SincedbValue:0x7ad390e2 @last_changed_at=1561487783.900475, @watched_file="<FileWatch::WatchedFile: @filename='678838948731_elasticloadbalancing_us-east-1_app.dealer-www.7bdb0490bfd23f0b_20190625T1835Z_34.206.171.194_2yudlb9g.log.gz', @state='ignored', @sincedb_key='1573908 0 51713, size=1040624>", @position=1104048>}

.. Then, the original log line above repeated many, many times. It appears, yes, it's discovering a filename and assigning it the sincedb key, then the sincedb key is effectively having a collision using a different filename.

I just don't understand -- it appears setting the sincedb to /dev/null does not, at all, deactivate it which is disappointing and quite misleading. I am going to set "sincedb_clean_after" to "1 second" and see if this will "turn it off" for me. Is this expected behaviour?

Yes.

Setting the sincedb_path to "/dev/null" prevents it persisting the in-memory to disk across restarts. It does not disable the sincedb itself.

Specifying a small value for sincedb_clean_after can significantly reduce the window in which reuse can occur. I do not know the internals of the file input well enough to say whether it could lead to files being processed twice. For example, a file is created, and the file input notices it, so it adds a sincedb entry and puts in the queue to be read. If the sincedb entry is purged, is the file left in the queue? I do not know. If it is then it might get queued twice if logstash gets backed up.

Well, I am using log_and_delete for the file_completed_action -- it really should never process anything twice. All the S3 bucket stuff is just handled by a bash script that is running alongside the instance.

Furthermore, we're using quite concise fingerprinting for _id, so if something does process twice, it's definitely not the end of the world.

I'll deal with, I guess, just writing a custom way to push the data using my own scripts and a unix socket or something plain like that. Otherwise, I suppose I should open an issue with this particular plugin and hope I'm not alone in the way we're using it.

@Badger, thank you for all your help. Really appreciate it!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.