Recommended configuration for write once files

ogauchard · May 23, 2017, 9:03am

Hi all,

I am looking for the most efficient configuration for our use case.

We have almost 2000 files to index each day coming from several directories.

Each file needs to be read only once as it is never updated.

I wonder how to configure ignore_older, clean_inactive and close_eof to manage this scenario with the best performances.

Another question, how to set these properties when re-indexing all the files (several monthes) ?

Thank you.

ruflin · May 23, 2017, 1:52pm

What happens to the old files of the previous day?

For the old files, it seems that ignore_older will become tricky. One thing you could try for the old files it using the -once param and run filebeat only once until completion?

ogauchard · May 23, 2017, 1:59pm

Old files of previous days are left in the same directory as new files.

I didn't know the -once param !

I think this a good option for me to index old files and then start filebeat in a "regular flow" way.

ogauchard · May 23, 2017, 2:52pm

Just to be sure about the way to properly use the -once option :

In my prospectors configuration, I don't set the ignore_older property and I set the close_eof to true
I start filebeat with the -once param and let it ingest all the files
Once done I update the configuration to properly configure the ignore_older and clean_inactive properties (letting close_eof to true)
I start filebeat without the -once param

Is that the good way ?

The only issue I see is that old files are in the same directories as new ones.

Maybe I need to move old files in dedicated directories and re-configure filebeat to the "live" directories once the -once execution is done.

Any advice ?

ruflin · May 26, 2017, 5:52am

I would also recommend to read the old files from an other directory. Potentially you could even use 2 different filebeat config files for this.

ogauchard · May 29, 2017, 7:26am

@ruflin Unfortunately I cannot deal with the files location. I have to index them without moving/deleting them.

ogauchard · May 29, 2017, 8:05am

I had created another topic about another issue with the same use-case described here : https://discuss.elastic.co/t/very-long-to-load-prospectors-and-start-harvesting/86625/8

I have done several tests and I prefer to merge all that stuff in this topic.

On a test server, I have indexed all the files "old" using the -once option and the following config (the ignore_older property is set to only index files after a given date):

filebeat.prospectors:

- input_type: log
  paths:
    - U:\foo\bar\*_TO_TARGET\*.csv
    - U:\foo\bar\directory1\export\*.CSV
    - U:\foo\bar\directory2\export\*.csv
  encoding: utf-8
  document_type: type_A
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true

- input_type: log
  paths:
    - U:\foo\bar\directory3\export\*.xml
  encoding: utf-8
  document_type: type_A
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true
  exclude_lines: [ '^<\?xml', '^<Document' ]
  multiline:
    pattern: '^[[:space:]]*<Node'
    negate: true
    match: after
    max_lines: 5000

- input_type: log
  paths:
    - U:\foo\bar\TARGET_TO_*\*.csv
  encoding: utf-8
  document_type: type_B
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true

- input_type: log
  paths:
    - T:\bar\foo\*.qid
  encoding: utf-8
  document_type: type_C
  scan_frequency: 30s
  ignore_older: 4080h
  close_eof: true
  multiline:
    pattern: ^EXENAME
    negate: true
    match: after

output.logstash:
  hosts: ["localhost:5044"]

All the source directories are Windows mounted drives on shares directories from two distinct servers.

After indexing the "old" files, I started filebeat without the -once option and with quite the same config as the previous one except on this :

  ignore_older: 4080h

replaced by this :

  ignore_older: 10m
  clean_inactive: 15m

The registry file size is now 40MB.

I do not encounter the "very long prospector loading" issue when restarting filebeat as described in the other topic.

But I have sometimes a quite big delay between the time a new file is copied to a watched directory and the time it is received by logstash (which is installed on the same server as filebeat for test purpose). The delay can be up to 15 minutes.

How can I reduce this delay ?

ruflin · May 31, 2017, 11:50pm

Normally the main reason for delays is scan_frequency. But in your case it is only 30s.

Are the files with the delay multiline with just one multiline event inside?
Do you see anything special in the log file?

ogauchard · June 2, 2017, 7:19am

The files with biggest delays are not multiline. There is only one csv line inside, and sometimes a few lines.

I don't see nothing special in the log files. I think the main problem is the amount of files in the directories that filebeat has to monitor.
As filebeat is monitoring files on two shared directories, I wanted to change the architecture and install filebeat directly on the servers that have the shared directories. But these servers are NAS and I can't install filebeat on them.

I wonder if this could be a good idea to have multiple filebeat instances that monitor a few directories instead of one instance that monitor many directories. This could decrease the registry size (which is almost 50MB actually). I will test it.

system · June 30, 2017, 7:19am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FileBeat file is falling under ignore older Part2 Beats filebeat	3	1210	November 3, 2016
Ignore_older - Infrequently updated log file Beats filebeat	2	1529	July 5, 2017
Ignoring old files Beats	2	6098	December 28, 2018
Filebeats for configuration file Beats filebeat	5	458	February 23, 2022
Which is write option to configure filebeat settings Beats filebeat	2	486	October 11, 2017

Recommended configuration for write once files

Related topics