Recommendations for parsing 1000's ~10MB files to backfill elasticsearch

drwg · March 27, 2019, 11:54am

I'm currently using filebeat -> logstash -> elastic to back fill elastic search with exit codes from several thousand text output files each 10's MB in size and >10k lines. ~200GB in total. The server can push 5GB/s read/write so that's not a bottleneck.

I'm configuring filebeat to search for specific keyworks which I am specifying in the "include_lines: ['keyword1', .......,'keywordN'], where the number of keywords could be as high as 20 but at present my problems (performance) are showing with a single keyword. I also wish to use exclude_lines at some point.

Performance is incredibly slow, any recommendations for improving the performance?

I suspect parsing the files is one aspect, how exacly does the include/exclude_lines work?

But also the number of harvesters which is started in parallel possibly? - I have tried limiting the number of harvesters by setting the harvester_limitto the number of cores.

filebeat.inputs:

type: log
enabled: true
paths:
- /pathto/symlinksdir/*
  symlinks: true
  tags: ["some_value"]
  fields: {log_type: "some_value2"}
  include_lines: ['keyword']

Thanks

ruflin · March 29, 2019, 3:46pm

I'm a bit suprised that you hit a bottleneck here on the Filebeat side. Could you share a bit more on how your include_lines statement looks like? Regexp?

What results do you see if you just ship everything? Much higher throughput?

steffens · March 29, 2019, 3:53pm

What is the ratio between lines published and lines filtered out. The registry file keeps track of the file offset, but needs some IO to be written. If the ratio is somewhat 'bad', then the registry writes will slow down filebeat, as it also requires some fsync when writing the registry. Setting filebeat.registry_flush: 1s helps in this case (See registry_flush docs).

system · April 26, 2019, 4:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU Usage - Windows Beats filebeat	6	2134	July 8, 2016
Determination Filebeat -> Elasticsearch performance Beats elastic-stack-monitoring , filebeat	3	352	March 5, 2019
Filebeat include_lines performance v.s. grep Beats filebeat	2	1666	November 9, 2018
Filebeat 6.2 throughput and general performance Beats filebeat	7	4473	April 3, 2018
Suggestion improving filebeat performance Beats filebeat	3	1224	November 24, 2017

Recommendations for parsing 1000's ~10MB files to backfill elasticsearch

Related topics