Logstash read file in what order?

theirfan · May 31, 2021, 9:19pm

Hi All,

Would like to know how logstash "input" reads files from a directory.

I have files which are created based on size limit and each new log creation is appended with the creation date in it's name.

Ex:
file-01012021-000000
file-01012021-100000 <----- this is created when above file reaches 10Mb.
file-01012021-150515

And so on.....

Is there a way I can make logstash read the files in order? I am using --pipeline.workers 1 to grab ordered data and i am not sure if the files will be read in order if I point the input file path to "/home/logs/file*"

Please advise.

Regards.

Badger · May 31, 2021, 10:04pm

See the file_sort_by and file_sort_order options.

theirfan · May 31, 2021, 10:17pm

file_sort_direction => "asc"
file_sort_by => "path"

Both of them worked as expected.

With pipeline.workers 1,
sincedb_path => "/some/dir/logdb" ,
start_position => "end"
file_sort_direction => "asc"

this setup will always read one file completely -> in order of name ( asc order ) -> and then move to the next file.

Am i getting this correct ? @Badger

Badger · May 31, 2021, 10:28pm

Sounds right. The default behaviour is to read up to 4,611,686,018,427,387,903 chunks of 32 KB from a file before it moves on to the next one. so it does indeed read each file completely.

Do you really want start_position => end?

theirfan · May 31, 2021, 10:34pm

I have two instances.

First : i need to index all the logs ( as mentioned above in order of file names ) with start_position to beginning
Second : Once the indexing is completed, i will change the start_position to end and continue from where i left ( as sincedb is already tracking the position of it ).

As new logs are created, i should be able to continue indexing new log file till it ends and the cycle will continue.

What are your thoughts on this?

Badger · May 31, 2021, 10:38pm

I would use beginning and just let the sincedb track things.

theirfan · May 31, 2021, 10:45pm

Will give it a try. I think this will discard the need to run the conf twice.

On another note,

fingerprint
{
key => "myrandomkey"
method => "SHA1"
source => "message"
target => "[fingerprint]"
base64encode => true
}

In source => "message" , "message" value can be any field? Like "host" or any other field which is present throughout the index?

Badger · May 31, 2021, 10:53pm

The source option of a fingerprint filter takes an array of field names, so you can do something like

source => [ "[host][name]", "message" ]

You will want to use the concatenate_sources option if you are fingerprinting multiple fields, otherwise you just get a fingerprint of the last source.

theirfan · May 31, 2021, 11:31pm

Let me share an example, @Badger

input { }

filter {

fingerprint
{
key => "number13"
method => "SHA1"
source => "host"
target => "[fingerprint]"
base64encode => true
}

if [path] =~ "dev1"
{
mutate
{
replace => { "host" => "dev1" }
}
else if [path] =~ "dev2"
{
mutate
{
replace => { "host" => "dev2" }
}
}

{
//some grokking//
}

mutate
{
remove_field => [ "info" , "@version" , "message" ]
}

}

output
{
stdout { codec => rubydebug }
elasticsearch
{
hosts => "localhost:9200"
index => "devs"
document_id => "%{[fingerprint]}"
}
}

When i do this, only one hit is recorded in ES, and each time, it is getting replaced by another updated message.

However, if i keep source => "message" and DO NOT INCLUDE "message" in mutate-remove_field, then everything is indexed.

The reason why i want to remove "message" is to fulfil the size limitations on the HDD.

Badger · June 1, 2021, 12:16am

Should the fingerprint that hashes [host] come after the conditionals that set it?

theirfan · June 1, 2021, 12:44am

If I am following correctly, are you suggesting that I should use fingerprint AFTER defining conditions for "hosts" ?

Badger · June 1, 2021, 1:10am

If you want the fingerprint to be the digest of dev1 or dev2 then yes, it must be after the conditionals. All the events probably have an id based on the digest of the name of the host that logstash is running on.

BTW, key is not required. It was at one time, and the filter was creating digests. Now, if you omit the key option it will just create a hash of the source field.

Also I would recommend using SHA256. SHA1 is no good for anything these days.

theirfan · June 1, 2021, 2:33am

I tried it and now, gives me 3 docs in ES.

And about the host, i have logs distributed into different folders named as dev1....dev24. Based on the path,

ex: /path/to/dev1/logs*
/path/to/dev2/logs*
/path/to/dev3/logs*
/path/to/dev4/logs*
and so on...

if path has "dev1" in it, the host value will be changed to dev1 and later, this will be used in adding fields like add_field => { "%{host}_temp" => "%{tempvalue}" }

theirfan · June 2, 2021, 11:18pm

Hi @Badger

The sorting of file is creating an issue for me.

i have temp.log, temp.log.0, temp.log.1 . . . . . .temp.log.10

While applying in file { } with (-w 1)

file_sort_by => "path"
file_sort_direction => "desc"

The parsing is skips from temp.log.1 to temp.log.10, and this creates a wrong time duration. See below stdout.

{
             "time" => "0.013000011444091797",
             "path" => "/home/temp.log.1",
    	     "dur" => 0.013000011444091797,
       "@timestamp" => 2021-05-31T06:50:53.378Z,
             "host" => "dev2",
             "temp2" => 20.25,
            "delta" => 0.013000011444091797
}
{
             "time" => "-3028117.7120001316",
             "path" => "/home/temp.log.10",
    	     "dur" => 0.013000011444091797,
       "@timestamp" => 2021-04-26T05:42:15.666Z,
             "host" => "dev2",
             "temp1" => 29.15,
            "delta" => -3028117.7120001316
}

What could be wrong here?

Badger · June 2, 2021, 11:44pm

If you create a file containing

log.0
log.1
log.2
log.10

and sort it using the UNIX sort command you will get

log.0
log.1
log.10
log.2

It looks like whatever sort function the file filter uses (probably the Ruby spaceship operator) agrees with that order.

theirfan · June 2, 2021, 11:54pm

thats a catch!

If this is the case, then i cannot use the ruby code as time duration values goes in negatives ( -3000000 ) seconds.

elapsed filter would have helped if i could tag the start and end, but with values ranging all the time, i cannot use it.

what do you think the way around this would be? Please advise

Badger · June 3, 2021, 12:24am

I do not think there is a solution for this problem in logstash.

theirfan · June 3, 2021, 2:08am

based on your earlier solution for getting time durations of each value, can we add another parameter to also check the change of path or filename too?

Like, this code of yours is checking of change of value and subtracting timestamps to get durations. If this code can also check change of path, and parse new file as separate file.

Let me know about it?

Badger · June 3, 2021, 2:16am

I am a bit lost as to what the question is now.

theirfan · June 3, 2021, 2:20am

Apologies, i was jumping from that post to this for combining the solutions!

i tried to use exclude => "temp.log.10" in the file input and seems to work properly, guess i have to make a way around to write another input in same config and parse that file.

Topic		Replies	Views
Logstash file input read order Logstash	6	1860	July 31, 2017
Analyse file in specific order Logstash	3	474	September 23, 2017
How logstash read input file? Logstash	10	20395	July 5, 2017
Sort order of files in s3 input Logstash	1	407	July 6, 2017
Logstash file input not working Logstash	8	8109	May 19, 2017

Logstash read file in what order?

Related topics