Logstash 7 (W7) issue - Input file issue with sincedb: reads file only once


#1

Good morning,

I have a process on my server where is installed Logstash which generates some data files in an output folder. I have configured my conf file so that Logstash 7 collect these data as soon they are published and push them into an Elasticsearch server.

Bellow is a very simple implementation of the way to do it:

pipelines.yml:

  - pipeline.id: test
    path.config: "C:\\logstash-7.0.0\\config\\conf\\test.conf"

test.conf:

input {
    file {
        mode => read
        path => "C:/output/test.txt"
        file_completed_action => delete
        sincedb_path => "nul"
    }
}

output {
    stdout {}
}

When a start logstash then create a test.txt file, it is successfully read. But if I create it once again, it is never read anymore.
I tried to specify a real sincedb path then delete the content file to force the process to read again my file, but it doesn't work too.

Logs:

2019-04-12T09:46:14,623][DEBUG][logstash.javapipeline ] Starting pipeline {:pipeline_id=>"test"}

[2019-04-12T09:46:14,645][INFO ][logstash.javapipeline ] Starting pipeline {:pipeline_id=>"test", "pipeline.workers"=>4, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50, "pipeline.max_inflight"=>500, :thread=>"#<Thread:0x118fa46a run>"}

[2019-04-12T09:46:15,296][INFO ][logstash.javapipeline ] Pipeline started {"pipeline.id"=>"test"}

[2019-04-12T09:46:15,342][DEBUG][logstash.javapipeline ] Pipeline started successfully {:pipeline_id=>"test", :thread=>"#<Thread:0x118fa46a run>"}

[2019-04-12T09:46:15,343][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.

[2019-04-12T09:46:15,453][INFO ][filewatch.observingread ] START, creating Discoverer, Watch with file and sincedb collections

[2019-04-12T09:46:15,550][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:test], :non_running_pipelines=>[]}

[2019-04-12T09:46:15,596][DEBUG][logstash.agent ] Starting puma

[2019-04-12T09:46:15,649][DEBUG][logstash.agent ] Trying to start WebServer {:port=>9600}

[2019-04-12T09:46:14,743][DEBUG][org.logstash.config.ir.CompiledPipeline] Compiled output

P[output-stdout{}|[str]pipeline:11:5:```

stdout {}

```]

into

org.logstash.config.ir.compiler.ComputeStepSyntaxElement@c96a6eaa

[2019-04-12T09:46:14,728][DEBUG][org.logstash.config.ir.CompiledPipeline] Compiled output

P[output-stdout{}|[str]pipeline:11:5:```

stdout {}

```]

into

org.logstash.config.ir.compiler.ComputeStepSyntaxElement@c96a6eaa

[2019-04-12T09:46:14,743][DEBUG][org.logstash.config.ir.CompiledPipeline] Compiled output

P[output-stdout{}|[str]pipeline:11:5:```

stdout {}

```]

into

org.logstash.config.ir.compiler.ComputeStepSyntaxElement@c96a6eaa

[2019-04-12T09:46:14,743][DEBUG][org.logstash.config.ir.CompiledPipeline] Compiled output

P[output-stdout{}|[str]pipeline:11:5:```

stdout {}

```]

into

org.logstash.config.ir.compiler.ComputeStepSyntaxElement@c96a6eaa

[2019-04-12T09:46:15,873][DEBUG][logstash.api.service ] [api-service] start

[2019-04-12T09:46:16,306][INFO ][filewatch.readmode.handlers.readfile] buffer_extract: a delimiter can't be found in current chunk, maybe there are no more delimiters or the delimiter is incorrect or the text before the delimiter, a 'line', is very large, if this message is logged often try increasing the `file_chunk_size` setting. {"delimiter"=>"\n", "read_position"=>0, "bytes_read_count"=>12, "last_known_file_size"=>12, "file_path"=>"C:/output/test.txt"}

[2019-04-12T09:46:16,322][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}

[2019-04-12T09:46:16,368][DEBUG][filewatch.sincedbcollection] writing sincedb (delta since last write = 1555055176)

[2019-04-12T09:46:16,415][DEBUG][logstash.inputs.file ] Received line {:path=>"C:/output/test.txt", :text=>"CONTENT TEST"}

And then it repetitively shows these logs:

[2019-04-12T09:46:28,203][DEBUG][logstash.instrument.periodicpoller.cgroup] One or more required cgroup files or directories not found: /proc/self/cgroup, /sys/fs/cgroup/cpuacct, /sys/fs/cgroup/cpu

[2019-04-12T09:46:28,528][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}

[2019-04-12T09:46:28,529][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}

Thanks for your help. :slight_smile:


Logstash is reindexing lastline and newline
#2

A specific way I found to counter this weird behavior is to dynamically generate my data name files with a random hash to ensure Logstash hasn't still read a same-named file.

For instance:

Generate the file "test-{md5_of_random_number}" or even simplier "test-{base64_of_timestamp}", then change the path parameter like this:

path => "C:/AnalysisProject/DataCollector/output/test-*.txt"

It works well because I'm abble to generate the files with a process I've created, which is not necessary the case for everyone.

If someone has the time, I'd like to know if this issue occurs only on Windows, or if it occurs because of "nul" value for "sincedb-path".


(Guy Boertje) #3

I think you can use C:/nul to actually refer to a windows nul file instead of letting JRuby/Java guess what you mean. OTOH I don't know about powershell and if it applies to your case.


#4

But is this working as expected? With

file {
    mode => read
    path => "/tmp/test.txt"
    file_completed_action => delete
    sincedb_path => "/dev/null"
}

If I "echo Foo > /tmp/test.txt" then I get an event and the file is deleted. If I then "echo Foo > /tmp/test.txt" a second time then I see a

[TRACE][filewatch.discoverer     ] discover_files {"count"=>1}

but it does not re-read the file. I assume it is looking in the in-memory sincedb and deciding it has already read the file. Obviously if I restart logstash then it does read it again since the in-memory sincedb is not persisted.


(Guy Boertje) #5

True but the "inode" would have to be the same as that is what the sincedb key is in the in-memory db (its just KV dictionary). IOW it is not the file name that is used as the key.

On windows the "inode" should be different for each file created.


#6

Thanks for your lights, it's interesting to have a global idea on the plugin mechanism. I suspected a kind of in-memory sincedb behaviour too, I have know the confirmation that's the case.

From @Badger answer, this issue seems to occur even on Linux (because of the use of "/dev/null" in his example). I unfortunately can't test this as I don't have any Linux server near at hand for the moment.

Otherwise, I tried something else :

input {
    file {
        mode => read
        path => "C:/output/test.txt"
        file_completed_action => log
        file_completed_log_path => "C:\output\log"
        sincedb_path => "C:\output\.test"
        sincedb_clean_after => "10 seconds"
    }
}

Here, the expected behaviour should be: read (from the beginning) the test.txt file every (approximately I guess) 10 seconds.

[2019-04-17T09:20:38,540][DEBUG][logstash.inputs.file     ] config LogStash::Inputs::File/@sincedb_clean_after = 10.0

But neither the content-file sincedb, neither the in-memory sincedb, is cleaned.

  • the sincedb file is never cleared
  • if a manually clear the sincedb file, text.txt is never read again (so the in-memory sincedb is not cleared too)

The last logs including the sincedb annotation are:

[2019-04-17T09:20:42,006][DEBUG][filewatch.sincedbcollection] writing sincedb (delta since last write = 1555485642)

[2019-04-17T09:20:42,013][TRACE][filewatch.sincedbcollection] sincedb_write: to: C:\output\.test

There is no trace of any cleaning step.

All in all, sincedb_clean_after should reset both in-memory and file sincedb to make this final workaround possible:

input {
    file {
        mode => read
        path => "{path to ressource}"
        file_completed_action => remove
        sincedb_path => "{path or null equivalent for corresponding OS}"
        sincedb_clean_after => "{reasonable short value}"
    }
}

(Guy Boertje) #7

The in-memory db is filled from the sincedb file once - at startup. It does not track the files ongoing state, while LS is running it is write only.

As far as the sincedb_clean_after setting is concerned, the in-memory db is a Key/Value store, the value is an object with state, the state is made up of A) the bytes read position, B) the timestamp of when the position last changed (i.e. the known last place that data was successfully sent to the pipeline) and C) a pointer to the watched_file instance as discovered. The value is considered "expired" when the timestamp + sincedb_clean_after is evaluated as earlier than the (each) time that the sincedb is serialised to the file. Or in english, when I come to persist this record, is it still inside the retention period?

The next question then is, when does the sincedb actually get serialised? As the file input is single threaded, persistence is attempted at a certain point in the read loop - at the end of reading a "chunk" where a chunk is a block of bytes set via file_chunk_size (defaults to 32K) or less if the file has less than 32K bytes yet to read. This yet to read is important - if LS starts up and all bytes have been read (or all discovered files have start_position = "end" and no new data is written) then the sincedb is not persisted as no changes have taken place. Note, there is a one-off sincedb write when the file input shuts down. Also note that the sincedb_collection object (the in-memory db implementation) tracks, via a boolean, whether any record in the collection has changed - this boolean controls whether the collection (db) is persisted as a whole. Lastly there is the sincedb_write_interval which sets the minimum time between sincedb writes.

The file input is designed and tries very hard to read and submit content only once.


#8

Thank you @guyboertje for this non-trivial but really accurate explanation. I understand that every plugin can't deserve every case of usage. A good feature for LS could be a plugin able to read files linearly (matching by a filename pattern as it is the case with this plugin), without any db storage.

I would like to focus on one last thing to properly understand.
You said previously that the Windows inode should be different for each new created file. Does it mean the inode has to be different, or the inode can unfortunately be different ? It's a little bit confusing, let me explain.

In this context, we consider there is no sincedb ( sincedb_path => "{null path}", or we manually remove this file each time a new file is read by Logstash). When a file is read by LS, it is removed ( file_completed_action => delete ).
If the inode is used as a key in the memory storage, it has to be different to make sure every "test.txt" created file (with a unique inode) will be properly read. So, if my path is "test*.txt", LS will correctly read (only once) "test.txt", "text2.txt", and so on, because they have a different inode.
Is it correct ?

So, I made some tests to understand how Windows and Linux manage their inodes.

On Windows: (using fsutil file queryfileid {file_name})

  • create "test.txt" file gives the ID "0x00e5000000041095"
  • remove the file then create it once again gives another ID

Conclusion : a same-named file always has a different "inode".

On Linux: (using stat {file_name})

  • create "test.txt" file gives the inode "104"
  • remove the file then create it once again gives the inode "104"
  • remove the file, then create "hop.txt" then "test.txt" gives:
    • the inode "104" for "hop.txt"
    • the inode "106" for "test.txt"

Conclusion : a same-named file has not necessarily the same inode.

So, I should be abble to read a same-named file with a different inode on Windows, am I wrong ? And it could also work for Linux in some cases.


(Guy Boertje) #9

I'll try to explain.

The file input (and filebeat) is really concerned with reading file based content (preferrably only once). As humans we can (easily) distinguish one content from another but not so for the code.
In order to identify this content to track whether we have seen it before or not, we have to pick some identifier somehow. We know that file names can change (via rotation for example) we can't use the filename. We know that an inode is associated with a collection of file segments and once created can't be changed without deletion & recreation. So while a file remains in the file system the inode can be used as an identifier. However, all is not perfect, because inodes can be reused when they are freed up through a file deletion, so if we hold on to sincedb entries indefinitely eventually we will see the same inode being used again for new/different content but the file input will think it is the same content and not re-read it.

Based on this description, you are correct that there is a flaw with the sincedb retention behaviour when file_completed_action => remove is specified. In fact there is an issue in the file input repo on this, I just have not had time to change it.


#10

Thanks a lot for all these clarifications, it was interesting to have more informations about the plugin behaviour. I hope a fix will be available for the future versions. :slight_smile: