Recommended configuration for write once files

Hi all,

I am looking for the most efficient configuration for our use case.

We have almost 2000 files to index each day coming from several directories.

Each file needs to be read only once as it is never updated.

I wonder how to configure ignore_older, clean_inactive and close_eof to manage this scenario with the best performances.

Another question, how to set these properties when re-indexing all the files (several monthes) ?

Thank you.

What happens to the old files of the previous day?

For the old files, it seems that ignore_older will become tricky. One thing you could try for the old files it using the -once param and run filebeat only once until completion?

Old files of previous days are left in the same directory as new files.

I didn't know the -once param !

I think this a good option for me to index old files and then start filebeat in a "regular flow" way.

Just to be sure about the way to properly use the -once option :

  • In my prospectors configuration, I don't set the ignore_older property and I set the close_eof to true
  • I start filebeat with the -once param and let it ingest all the files
  • Once done I update the configuration to properly configure the ignore_older and clean_inactive properties (letting close_eof to true)
  • I start filebeat without the -once param

Is that the good way ?

The only issue I see is that old files are in the same directories as new ones.

Maybe I need to move old files in dedicated directories and re-configure filebeat to the "live" directories once the -once execution is done.

Any advice ?

I would also recommend to read the old files from an other directory. Potentially you could even use 2 different filebeat config files for this.

@ruflin Unfortunately I cannot deal with the files location. I have to index them without moving/deleting them.

I had created another topic about another issue with the same use-case described here : https://discuss.elastic.co/t/very-long-to-load-prospectors-and-start-harvesting/86625/8

I have done several tests and I prefer to merge all that stuff in this topic.

On a test server, I have indexed all the files "old" using the -once option and the following config (the ignore_older property is set to only index files after a given date):

filebeat.prospectors:

- input_type: log
  paths:
    - U:\foo\bar\*_TO_TARGET\*.csv
    - U:\foo\bar\directory1\export\*.CSV
    - U:\foo\bar\directory2\export\*.csv
  encoding: utf-8
  document_type: type_A
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true

- input_type: log
  paths:
    - U:\foo\bar\directory3\export\*.xml
  encoding: utf-8
  document_type: type_A
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true
  exclude_lines: [ '^<\?xml', '^<Document' ]
  multiline:
    pattern: '^[[:space:]]*<Node'
    negate: true
    match: after
    max_lines: 5000

- input_type: log
  paths:
    - U:\foo\bar\TARGET_TO_*\*.csv
  encoding: utf-8
  document_type: type_B
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true

- input_type: log
  paths:
    - T:\bar\foo\*.qid
  encoding: utf-8
  document_type: type_C
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true
  multiline:
    pattern: ^EXENAME
    negate: true
    match: after

output.logstash:
  hosts: ["localhost:5044"]

All the source directories are Windows mounted drives on shares directories from two distinct servers.

After indexing the "old" files, I started filebeat without the -once option and with quite the same config as the previous one except on this :

  ignore_older: 4080h

replaced by this :

  ignore_older: 10m
  clean_inactive: 15m

The registry file size is now 40MB.

I do not encounter the "very long prospector loading" issue when restarting filebeat as described in the other topic.

But I have sometimes a quite big delay between the time a new file is copied to a watched directory and the time it is received by logstash (which is installed on the same server as filebeat for test purpose). The delay can be up to 15 minutes.

How can I reduce this delay ?

Normally the main reason for delays is scan_frequency. But in your case it is only 30s.

  • Are the files with the delay multiline with just one multiline event inside?
  • Do you see anything special in the log file?

The files with biggest delays are not multiline. There is only one csv line inside, and sometimes a few lines.

I don't see nothing special in the log files. I think the main problem is the amount of files in the directories that filebeat has to monitor.
As filebeat is monitoring files on two shared directories, I wanted to change the architecture and install filebeat directly on the servers that have the shared directories. But these servers are NAS and I can't install filebeat on them.

I wonder if this could be a good idea to have multiple filebeat instances that monitor a few directories instead of one instance that monitor many directories. This could decrease the registry size (which is almost 50MB actually). I will test it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.