Recommendations for parsing 1000's ~10MB files to backfill elasticsearch

I'm currently using filebeat -> logstash -> elastic to back fill elastic search with exit codes from several thousand text output files each 10's MB in size and >10k lines. ~200GB in total. The server can push 5GB/s read/write so that's not a bottleneck.

I'm configuring filebeat to search for specific keyworks which I am specifying in the "include_lines: ['keyword1', .......,'keywordN'], where the number of keywords could be as high as 20 but at present my problems (performance) are showing with a single keyword. I also wish to use exclude_lines at some point.

Performance is incredibly slow, any recommendations for improving the performance?

I suspect parsing the files is one aspect, how exacly does the include/exclude_lines work?

But also the number of harvesters which is started in parallel possibly? - I have tried limiting the number of harvesters by setting the harvester_limitto the number of cores.

filebeat.inputs:

  • type: log
    enabled: true
    paths:
    • /pathto/symlinksdir/*
      symlinks: true
      tags: ["some_value"]
      fields: {log_type: "some_value2"}
      include_lines: ['keyword']

Thanks

I'm a bit suprised that you hit a bottleneck here on the Filebeat side. Could you share a bit more on how your include_lines statement looks like? Regexp?

What results do you see if you just ship everything? Much higher throughput?

What is the ratio between lines published and lines filtered out. The registry file keeps track of the file offset, but needs some IO to be written. If the ratio is somewhat 'bad', then the registry writes will slow down filebeat, as it also requires some fsync when writing the registry. Setting filebeat.registry_flush: 1s helps in this case (See registry_flush docs).

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.