Multiple file inputs cause re-reading of file

henning_l · August 21, 2024, 12:14pm

I am running an ELK stack and tailoring the Logstash configurations to elicit data from several different logs. Because the logs are formatted differently I need to create multiple file inputs. The following is the current configuration:

input {
  file {
    mode => "read"
    path => "/usr/share/logstash/ingest_data/**/*"
    file_completed_action => "log"
    file_completed_log_path => "/usr/share/logstash/finalised_data/logstash_completed.log"
    exclude => [ "/usr/share/logstash/ingest_data/app/zsh/*" ]
  }
  file {
    mode => "read"
    path => "/usr/share/logstash/ingest_data/app/zsh/*"
    file_completed_action => "log"
    file_completed_log_path => "/usr/share/logstash/finalised_data/logstash_completed.log"
    codec => multiline {
      pattern => "\\$"
      what => "next"
    }
  }
}

However, my current configurations result in logstash reading files in the directory /usr/share/logstash/ingest_data/app/zsh/ twice, even though I explicitly configured the first file input to exclude files in that directory (I checked the file /usr/share/logstash/finalised_data/logstash_completed.log and two entries are created when I add a single file)

I cannot find an explanation for this other than unintentional behaviour when having multiple file inputs.

To debug the issue tried to use only one of the file inputs at a time. When using the first input a zsh log is not read, and when using the second input a zsh log is read just once as intended.

I am running a docker compose stack as described here with the version 8.7.1.

Wolfram_Haussig · August 22, 2024, 5:16am

Hello Henning,

When having multiple file inputs, you need to configure sincedb_path explicitly.

Path of the sincedb database file (keeps track of the current position of monitored log files) that will be written to disk. The default will write sincedb files to <path.data>/plugins/inputs/file NOTE: it must be a file path and not a directory path

If not set, the values of both file inputs will overwrite each other. Maybe this is the reason for your issues.

Best regards
Wolfram

henning_l · August 22, 2024, 5:58am

Hi Wolfram

Thanks for your answer. I have tried to follow your suggestion and by setting the sincedb_path explicitly (the field file_input is just to make it easier for me to debug):

input {
  file {
    mode => "read"
    path => "/usr/share/logstash/ingest_data/**/*"
    file_completed_action => "log"
    file_completed_log_path => "/usr/share/logstash/finalised_data/logstash_completed.log"
    exclude => [ "/usr/share/logstash/ingest_data/app/zsh/*" ]
    sincedb_path => "/usr/share/logstash/file_all"
    add_field => { "file_input" => "all" }
  }
  file {
    mode => "read"
    path => "/usr/share/logstash/ingest_data/app/zsh/*"
    file_completed_action => "log"
    file_completed_log_path => "/usr/share/logstash/finalised_data/logstash_completed.log"
    codec => multiline {
      pattern => "\\$"
      what => "next"
    }
    sincedb_path => "/usr/share/logstash/file_zsh"
    add_field => { "file_input" => "zsh" }
  }
}

However, this renders the same output. Any suggestions?

Wolfram_Haussig · August 28, 2024, 4:49am

Hello Henning,

I am not sure, but I think your exclude may be wrong. According to the docs:

Exclusions (matched against the filename, not full path). Filename patterns are valid here, too. For example, if you have

path => "/var/log/*"

In Tail mode, you might want to exclude gzipped files:

exclude => "*.gz"

This could explain why it doesn't exclude the zsh logs from the file input.

How many directories do you have under /usr/share/logstash/ingest_data? Would it be possible to list them separately under path, e.g.:

file {
    mode => "read"
    path => [
        "/usr/share/logstash/ingest_data/dir1/**/*",
        "/usr/share/logstash/ingest_data/dir2/**/*",
        "/usr/share/logstash/ingest_data/dir3/**/*",
        "/usr/share/logstash/ingest_data/app/not_zsh/**/*"
   ]
    file_completed_action => "log"
    file_completed_log_path => "/usr/share/logstash/finalised_data/logstash_completed.log"
    exclude => [ "/usr/share/logstash/ingest_data/app/zsh/*" ]
  }

Best regards
Wolfram

henning_l · August 28, 2024, 1:44pm

Hi Wolfram

After doing some additional testing, I have realised, that the issue is not with multiple filters - my bad. When only using the initial file input, logs are still processed in the pipeline even though they are not supposed to.

With that out of the way (and maybe I should create a new topic, since the title does not describe the true issue), you suggestion of splitting the path up in several paths is doable, but I am planning my project to be widely extendable, i.e. I want to have many directories/subdirectories. Thus, it is not a suitable solution for me.

I my understanding of the docs, it is possible to exclude all files in a subdirectory, though its parents directory is part of the path option.

I have tried to strip as much of my Logstash configuration, which reads files located in /usr/share/logstash/ingest_data/app/zsh/:

input {
  file {
    mode => "read"
    path => [ "/usr/share/logstash/ingest_data/**/*"]
    exclude => ["/usr/share/logstash/ingest_data/app/zsh/*"]
    sincedb_path => "/usr/share/logstash/file_all"
  }
}

filter { }

output {
  elasticsearch {
    index => "logstash-%{+YYYY.MM.dd}"
    hosts=> "${ELASTIC_HOSTS}"
    user=> "${ELASTIC_USER}"
    password=> "${ELASTIC_PASSWORD}"
    cacert=> "certs/ca/ca.crt"
  }
  stdout { codec => rubydebug }
}

Adding a log file to the aforementioned directory triggers it to run through the pipeline and print the following in the standard output

...
{
  "message" => ": 1724069235:0;tail -n650 ~/.zsh_history >> material/.zsh_history",
  "@version" => "1",
  "host" => {
    "name" => "a480e196997b"
  },
  "log" => {
    "file" => {
      "path" => "/usr/share/logstash/ingest_data/app/zsh/zsh_history"
    }
  },
  "@timestamp" => 2024-08-28T13:27:26.167928882Z,
  "event" => {
    "original" => ": 1724069235:0;tail -n650 ~/.zsh_history >> material/.zsh_history"
  }
}
...

I do not know whether this helps in finding a solution?

Cheers, Henning

Badger · August 28, 2024, 3:21pm

I do not expect that to work. As @Wolfram_Haussig said, the exclude option takes a filename pattern. If you look at the source code you will see that it is calling basename so all of the directory names are discarded before the comparison is made.

My understanding is that fnmatch? requires the whole pattern to match. Thus basename will reduce /usr/share/logstash/ingest_data/app/zsh/foo.txt to foo.txt and foo,txt does not match /usr/share/logstash/ingest_data/app/zsh/*. In terms of the source, watched_file.pathname would match, but watched_file.pathname.basename does not.

henning_l · August 29, 2024, 11:18am

Finally, I understand how works. Thank you both @Wolfram_Haussig and @Badger. The capital 'I' in "In Tail mode..." in the docs mislead me to believe it described two different examples, but now I understand it is the same example with a path (path => /var/log/*) and a pattern (exclude => "*.gz") for excluding files GunZip files.

I marked @Wolfram_Haussig's answer as a solution as i best described a solution to my question.

Topic		Replies	Views
Logstash Multiple File Inputs Logstash	3	20847	July 6, 2017
Logstash File input filter, ingest twice same file name Logstash	2	890	September 7, 2020
File input plugin does not read files Logstash	6	1739	May 3, 2019
How does sincedb work for a Logstash reading multiple files from a single directory? Logstash	6	12047	October 7, 2017
Logstash 7 (W7) issue - Input file issue with sincedb: reads file only once Logstash	10	3844	May 16, 2019

Multiple file inputs cause re-reading of file

Related topics