Slow performance on big unstructured file

nestor1 · September 19, 2021, 1:40pm

I have couple of millions small files, but few of them are extremely big (comparing to small ones), around 200-400 GB. Pipeline is set to fetch data from each file (1 file -> 1 event in Elasticsearch). It's working ok, but when it comes to large file it takes around 2-3, sometimes even more hours to process it.

We use multiline plugin codec and ruby code to filter out what's needed. I tried to remove all filters, and processing time is not changing at all, so I guess that's not the problem here. I also tried to play with max_lines, max_open_files, auto_flush_interval etc, but without any luck. From hardware perspective, I have 32GB heap (total memory is 250GB), SSD disks (iostat is looking good), I don't know what else to try, and is it possible to even speed this up?

input {
    file {
        path => "/opt/logstash/logs/*"
	codec => multiline {
            pattern => "^}"
            negate => true
            what => next
                }
        start_position => "beginning"
        sincedb_path => "/dev/null"
    }
}

Badger · September 19, 2021, 3:22pm

If you are trying to process each file as a single event and you have a 200 GB file then you will require at least 200 GB of memory to hold the event. There is no way it will work with a 32 GB heap. I suggest you enable GC logging to see if your JVM is stuck doing endless full GC cycles.

nestor1 · September 19, 2021, 4:46pm

[2021-09-19T18:31:17,745][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2021-09-19T18:31:17,745][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2021-09-19T18:31:18,676][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[1536.104s][info   ][gc,start     ] GC(4) Pause Young (Allocation Failure)
[1536.104s][info   ][gc,task      ] GC(4) Using 23 workers of 23 for evacuation
[1536.567s][info   ][gc,heap      ] GC(4) ParNew: 1741737K->138340K(1763584K)
[1536.567s][info   ][gc,heap      ] GC(4) CMS: 0K->96496K(14817728K)
[1536.567s][info   ][gc,metaspace ] GC(4) Metaspace: 92055K->92055K(1140736K)
[1536.567s][info   ][gc           ] GC(4) Pause Young (Allocation Failure) 1700M->229M(16192M) 463.309ms
[1536.567s][info   ][gc,cpu       ] GC(4) User=2.46s Sys=0.67s Real=0.47s
[2021-09-19T18:31:22,748][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2021-09-19T18:31:22,748][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2021-09-19T18:31:23,676][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.

This is gc "logs" in logstash, and then just this "flush" logs for 45 minutes..

Badger · September 19, 2021, 5:17pm

That looks like it is not doing anything. You could enable log.level trace so that filewatch tells you each time it reads something from a file. It is possible you should be enabling auto-flush on the codec.

nestor1 · September 19, 2021, 5:31pm

I guess I misstyped, files are 200-400 MB, not GB.. if i leave auto_flush to default, it just puts plenty of documents instead of one in index, probably because it's flushing before it reached the end of file..

[user@host]$ wc -l file1.error.0
5648964 file1.error.0

[user@host]$ ls -l
total 434616
-rwxr-xr-x. 1 user group 222522407 Sep 18 00:41 file1.error.0
-rwxr-xr-x. 1 uerr group 222521798 Sep 18 00:41 file2.error.0

increased auto-flush to 20, but it still runs "forever", probably because over 5 million lines..

21-09-19T19:51:12,539][TRACE][filewatch.tailmode.processor] Watched processing
[2021-09-19T19:51:12,540][TRACE][filewatch.tailmode.processor] Active - no change {"watched_file"=>"<FileWatch::WatchedFile: @filename='file1.error.0', @state='active', @recent_states='[:watched, :watched]', @bytes_read='222521798', @bytes_unread='0', current_size='222521798', last_stat_size='222521798', file_open?='true', @initial=false, @sincedb_key='8915917 0 64768'>"}
[709.571s][info   ][gc,start     ] GC(3735) Pause Young (Allocation Failure)
[709.572s][info   ][gc,task      ] GC(3735) Using 23 workers of 23 for evacuation
[709.586s][info   ][gc,heap      ] GC(3735) ParNew: 1735294K->170249K(1763584K)
[709.586s][info   ][gc,heap      ] GC(3735) CMS: 2404834K->2404834K(14817728K)
[709.586s][info   ][gc,metaspace ] GC(3735) Metaspace: 93132K->93132K(1140736K)
[709.586s][info   ][gc           ] GC(3735) Pause Young (Allocation Failure) 4043M->2514M(16192M) 14.566ms
[709.586s][info   ][gc,cpu       ] GC(3735) User=0.29s Sys=0.01s Real=0.01s
[709.700s][info   ][gc,start     ] GC(3736) Pause Young (Allocation Failure)
[709.700s][info   ][gc,task      ] GC(3736) Using 23 workers of 23 for evacuation
[709.715s][info   ][gc,heap      ] GC(3736) ParNew: 1734238K->165368K(1763584K)
[709.715s][info   ][gc,heap      ] GC(3736) CMS: 2404834K->2404836K(14817728K)
[709.715s][info   ][gc,metaspace ] GC(3736) Metaspace: 93132K->93132K(1140736K)
[709.715s][info   ][gc           ] GC(3736) Pause Young (Allocation Failure) 4042M->2509M(16192M) 14.642ms
[709.715s][info   ][gc,cpu       ] GC(3736) User=0.29s Sys=0.00s Real=0.02s
[709.829s][info   ][gc,start     ] GC(3737) Pause Young (Allocation Failure)
[709.829s][info   ][gc,task      ] GC(3737) Using 23 workers of 23 for evacuation
[709.851s][info   ][gc,heap      ] GC(3737) ParNew: 1730631K->168343K(1763584K)
[709.851s][info   ][gc,heap      ] GC(3737) CMS: 2404836K->2404837K(14817728K)
[709.851s][info   ][gc,metaspace ] GC(3737) Metaspace: 93132K->93132K(1140736K)
[709.851s][info   ][gc           ] GC(3737) Pause Young (Allocation Failure) 4038M->2512M(16192M) 21.583ms
[709.851s][info   ][gc,cpu       ] GC(3737) User=0.26s Sys=0.10s Real=0.02s
[709.967s][info   ][gc,start     ] GC(3738) Pause Young (Allocation Failure)
[709.967s][info   ][gc,task      ] GC(3738) Using 23 workers of 23 for evacuation
[709.982s][info   ][gc,heap      ] GC(3738) ParNew: 1734121K->169792K(1763584K)
[709.982s][info   ][gc,heap      ] GC(3738) CMS: 2404837K->2404837K(14817728K)
[709.982s][info   ][gc,metaspace ] GC(3738) Metaspace: 93132K->93132K(1140736K)
[709.982s][info   ][gc           ] GC(3738) Pause Young (Allocation Failure) 4041M->2514M(16192M) 15.107ms
[709.982s][info   ][gc,cpu       ] GC(3738) User=0.27s Sys=0.04s Real=0.02s
[710.099s][info   ][gc,start     ] GC(3739) Pause Young (Allocation Failure)
[710.099s][info   ][gc,task      ] GC(3739) Using 23 workers of 23 for evacuation
[710.115s][info   ][gc,heap      ] GC(3739) ParNew: 1736629K->161248K(1763584K)
[710.115s][info   ][gc,heap      ] GC(3739) CMS: 2404837K->2404837K(14817728K)
[710.115s][info   ][gc,metaspace ] GC(3739) Metaspace: 93132K->93132K(1140736K)
[710.115s][info   ][gc           ] GC(3739) Pause Young (Allocation Failure) 4044M->2505M(16192M) 16.103ms
[710.115s][info   ][gc,cpu       ] GC(3739) User=0.28s Sys=0.01s Real=0.02s

some more gc logs

Badger · September 19, 2021, 6:58pm

[709.586s][info ][gc ] GC(3735) Pause Young (Allocation Failure) 4043M->2514M(16192M) 14.566ms
[709.715s][info ][gc ] GC(3736) Pause Young (Allocation Failure) 4042M->2509M(16192M) 14.642ms
[709.851s][info ][gc ] GC(3737) Pause Young (Allocation Failure) 4038M->2512M(16192M) 21.583ms
[709.982s][info ][gc ] GC(3738) Pause Young (Allocation Failure) 4041M->2514M(16192M) 15.107ms
[710.115s][info ][gc ] GC(3739) Pause Young (Allocation Failure) 4044M->2505M(16192M) 16.103ms

In 529 ms the JVM had 82 ms of GC pauses. 15% of time paused in GC is a huge number, but still leaves 85% of the time for the JVM to do something. I could write more about these GC logs but the short version is that you do not have a GC problem.

The filewatch log

@filename='file1.error.0', ...@bytes_read='222521798', @bytes_unread='0', current_size='222521798'

indicates it has read all of that file. Are the events from that file in elasticsearch?

nestor1 · September 19, 2021, 7:34pm

Thanks, no, events are not being ingested, and after some time (45 minutes cca) i get index created with 23 documents in it (it should be only 2), and I can't search for them, it takes 10 seconds on GET index/_search and it shows only 1 document but with one field which is empty..

nestor1 · September 19, 2021, 8:01pm

If i leave max_lines empty, then it's processing a bunch of documents, but most of the fields are empty because I guess he's doing a bulk before scanning the complete document. Looks like he's taking N number of lines and doing a bulk, and so on.. I need to set max lines to bigger number, but is it supposed to be 6 million, that sounds a little bit weird.. for smaller files auto_flush 5 is working ok

nestor1 · September 22, 2021, 4:50pm

Is there any option to tell logstash that one file should be one document?

Badger · September 22, 2021, 5:31pm

Use a multiline codec. An example is here.

system · October 20, 2021, 5:32pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Slow processing from s3 for larger gzip file Logstash	1	622	February 1, 2018
Small Scale elasticsearch implementation Elasticsearch	2	445	July 6, 2017
Multiline codec and csv filter performance Logstash	4	396	August 19, 2022
Logstash to process multiple files of large size? Logstash	1	269	April 19, 2022
Logstash - Slow data processing Logstash	19	112	December 3, 2024

Slow performance on big unstructured file

Related topics