Logstash read file in what order?

Hi All,

Would like to know how logstash "input" reads files from a directory.

I have files which are created based on size limit and each new log creation is appended with the creation date in it's name.

Ex:
file-01012021-000000
file-01012021-100000 <----- this is created when above file reaches 10Mb.
file-01012021-150515

And so on.....

Is there a way I can make logstash read the files in order? I am using --pipeline.workers 1 to grab ordered data and i am not sure if the files will be read in order if I point the input file path to "/home/logs/file*"

Please advise.

Regards.

See the file_sort_by and file_sort_order options.

1 Like

file_sort_direction => "asc"
file_sort_by => "path"

Both of them worked as expected.

With pipeline.workers 1,
sincedb_path => "/some/dir/logdb" ,
start_position => "end"
file_sort_direction => "asc"

this setup will always read one file completely -> in order of name ( asc order ) -> and then move to the next file.

Am i getting this correct ? @Badger

Sounds right. The default behaviour is to read up to 4,611,686,018,427,387,903 chunks of 32 KB from a file before it moves on to the next one. so it does indeed read each file completely.

Do you really want start_position => end?

I have two instances.

First : i need to index all the logs ( as mentioned above in order of file names ) with start_position to beginning
Second : Once the indexing is completed, i will change the start_position to end and continue from where i left ( as sincedb is already tracking the position of it ).

As new logs are created, i should be able to continue indexing new log file till it ends and the cycle will continue.

What are your thoughts on this?

I would use beginning and just let the sincedb track things.

1 Like

Will give it a try. I think this will discard the need to run the conf twice.

On another note,

fingerprint
{
key => "myrandomkey"
method => "SHA1"
source => "message"
target => "[fingerprint]"
base64encode => true
}

In source => "message" , "message" value can be any field? Like "host" or any other field which is present throughout the index?

The source option of a fingerprint filter takes an array of field names, so you can do something like

source => [ "[host][name]", "message" ]

You will want to use the concatenate_sources option if you are fingerprinting multiple fields, otherwise you just get a fingerprint of the last source.

Let me share an example, @Badger

input { }

filter {

fingerprint
{
key => "number13"
method => "SHA1"
source => "host"
target => "[fingerprint]"
base64encode => true
}

if [path] =~ "dev1"
{
mutate
{
replace => { "host" => "dev1" }
}
else if [path] =~ "dev2"
{
mutate
{
replace => { "host" => "dev2" }
}
}

{
//some grokking//
}

mutate
{
remove_field => [ "info" , "@version" , "message" ]
}

}

output
{
stdout { codec => rubydebug }
elasticsearch
{
hosts => "localhost:9200"
index => "devs"
document_id => "%{[fingerprint]}"
}
}

When i do this, only one hit is recorded in ES, and each time, it is getting replaced by another updated message.

However, if i keep source => "message" and DO NOT INCLUDE "message" in mutate-remove_field, then everything is indexed.

The reason why i want to remove "message" is to fulfil the size limitations on the HDD.

Should the fingerprint that hashes [host] come after the conditionals that set it?

If I am following correctly, are you suggesting that I should use fingerprint AFTER defining conditions for "hosts" ?

If you want the fingerprint to be the digest of dev1 or dev2 then yes, it must be after the conditionals. All the events probably have an id based on the digest of the name of the host that logstash is running on.

BTW, key is not required. It was at one time, and the filter was creating digests. Now, if you omit the key option it will just create a hash of the source field.

Also I would recommend using SHA256. SHA1 is no good for anything these days.

I tried it and now, gives me 3 docs in ES.

And about the host, i have logs distributed into different folders named as dev1....dev24. Based on the path,

ex: /path/to/dev1/logs*
/path/to/dev2/logs*
/path/to/dev3/logs*
/path/to/dev4/logs*
and so on...

if path has "dev1" in it, the host value will be changed to dev1 and later, this will be used in adding fields like add_field => { "%{host}_temp" => "%{tempvalue}" }

Hi @Badger

The sorting of file is creating an issue for me.

i have temp.log, temp.log.0, temp.log.1 . . . . . .temp.log.10

While applying in file { } with (-w 1)

file_sort_by => "path"
file_sort_direction => "desc"

The parsing is skips from temp.log.1 to temp.log.10, and this creates a wrong time duration. See below stdout.

{
             "time" => "0.013000011444091797",
             "path" => "/home/temp.log.1",
    	     "dur" => 0.013000011444091797,
       "@timestamp" => 2021-05-31T06:50:53.378Z,
             "host" => "dev2",
             "temp2" => 20.25,
            "delta" => 0.013000011444091797
}
{
             "time" => "-3028117.7120001316",
             "path" => "/home/temp.log.10",
    	     "dur" => 0.013000011444091797,
       "@timestamp" => 2021-04-26T05:42:15.666Z,
             "host" => "dev2",
             "temp1" => 29.15,
            "delta" => -3028117.7120001316
}

What could be wrong here?

If you create a file containing

log.0
log.1
log.2
log.10

and sort it using the UNIX sort command you will get

log.0
log.1
log.10
log.2

It looks like whatever sort function the file filter uses (probably the Ruby spaceship operator) agrees with that order.

thats a catch!

If this is the case, then i cannot use the ruby code as time duration values goes in negatives ( -3000000 ) seconds.

elapsed filter would have helped if i could tag the start and end, but with values ranging all the time, i cannot use it.

what do you think the way around this would be? Please advise

I do not think there is a solution for this problem in logstash.

based on your earlier solution for getting time durations of each value, can we add another parameter to also check the change of path or filename too?

Like, this code of yours is checking of change of value and subtracting timestamps to get durations. If this code can also check change of path, and parse new file as separate file.

Let me know about it?

I am a bit lost as to what the question is now.

Apologies, i was jumping from that post to this for combining the solutions!

i tried to use exclude => "temp.log.10" in the file input and seems to work properly, guess i have to make a way around to write another input in same config and parse that file.