Slow performance on big unstructured file

I have couple of millions small files, but few of them are extremely big (comparing to small ones), around 200-400 GB. Pipeline is set to fetch data from each file (1 file -> 1 event in Elasticsearch). It's working ok, but when it comes to large file it takes around 2-3, sometimes even more hours to process it.

We use multiline plugin codec and ruby code to filter out what's needed. I tried to remove all filters, and processing time is not changing at all, so I guess that's not the problem here. I also tried to play with max_lines, max_open_files, auto_flush_interval etc, but without any luck. From hardware perspective, I have 32GB heap (total memory is 250GB), SSD disks (iostat is looking good), I don't know what else to try, and is it possible to even speed this up?

input {
    file {
        path => "/opt/logstash/logs/*"
	codec => multiline {
            pattern => "^}"
            negate => true
            what => next
                }
        start_position => "beginning"
        sincedb_path => "/dev/null"
    }
}

If you are trying to process each file as a single event and you have a 200 GB file then you will require at least 200 GB of memory to hold the event. There is no way it will work with a 32 GB heap. I suggest you enable GC logging to see if your JVM is stuck doing endless full GC cycles.

[2021-09-19T18:31:17,745][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2021-09-19T18:31:17,745][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2021-09-19T18:31:18,676][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.
[1536.104s][info   ][gc,start     ] GC(4) Pause Young (Allocation Failure)
[1536.104s][info   ][gc,task      ] GC(4) Using 23 workers of 23 for evacuation
[1536.567s][info   ][gc,heap      ] GC(4) ParNew: 1741737K->138340K(1763584K)
[1536.567s][info   ][gc,heap      ] GC(4) CMS: 0K->96496K(14817728K)
[1536.567s][info   ][gc,metaspace ] GC(4) Metaspace: 92055K->92055K(1140736K)
[1536.567s][info   ][gc           ] GC(4) Pause Young (Allocation Failure) 1700M->229M(16192M) 463.309ms
[1536.567s][info   ][gc,cpu       ] GC(4) User=2.46s Sys=0.67s Real=0.47s
[2021-09-19T18:31:22,748][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ParNew"}
[2021-09-19T18:31:22,748][DEBUG][logstash.instrument.periodicpoller.jvm] collector name {:name=>"ConcurrentMarkSweep"}
[2021-09-19T18:31:23,676][DEBUG][org.logstash.execution.PeriodicFlush] Pushing flush onto pipeline.

This is gc "logs" in logstash, and then just this "flush" logs for 45 minutes..

That looks like it is not doing anything. You could enable log.level trace so that filewatch tells you each time it reads something from a file. It is possible you should be enabling auto-flush on the codec.

I guess I misstyped, files are 200-400 MB, not GB.. if i leave auto_flush to default, it just puts plenty of documents instead of one in index, probably because it's flushing before it reached the end of file..

[user@host]$ wc -l file1.error.0
5648964 file1.error.0

[user@host]$ ls -l
total 434616
-rwxr-xr-x. 1 user group 222522407 Sep 18 00:41 file1.error.0
-rwxr-xr-x. 1 uerr group 222521798 Sep 18 00:41 file2.error.0

increased auto-flush to 20, but it still runs "forever", probably because over 5 million lines..

21-09-19T19:51:12,539][TRACE][filewatch.tailmode.processor] Watched processing
[2021-09-19T19:51:12,540][TRACE][filewatch.tailmode.processor] Active - no change {"watched_file"=>"<FileWatch::WatchedFile: @filename='file1.error.0', @state='active', @recent_states='[:watched, :watched]', @bytes_read='222521798', @bytes_unread='0', current_size='222521798', last_stat_size='222521798', file_open?='true', @initial=false, @sincedb_key='8915917 0 64768'>"}
[709.571s][info   ][gc,start     ] GC(3735) Pause Young (Allocation Failure)
[709.572s][info   ][gc,task      ] GC(3735) Using 23 workers of 23 for evacuation
[709.586s][info   ][gc,heap      ] GC(3735) ParNew: 1735294K->170249K(1763584K)
[709.586s][info   ][gc,heap      ] GC(3735) CMS: 2404834K->2404834K(14817728K)
[709.586s][info   ][gc,metaspace ] GC(3735) Metaspace: 93132K->93132K(1140736K)
[709.586s][info   ][gc           ] GC(3735) Pause Young (Allocation Failure) 4043M->2514M(16192M) 14.566ms
[709.586s][info   ][gc,cpu       ] GC(3735) User=0.29s Sys=0.01s Real=0.01s
[709.700s][info   ][gc,start     ] GC(3736) Pause Young (Allocation Failure)
[709.700s][info   ][gc,task      ] GC(3736) Using 23 workers of 23 for evacuation
[709.715s][info   ][gc,heap      ] GC(3736) ParNew: 1734238K->165368K(1763584K)
[709.715s][info   ][gc,heap      ] GC(3736) CMS: 2404834K->2404836K(14817728K)
[709.715s][info   ][gc,metaspace ] GC(3736) Metaspace: 93132K->93132K(1140736K)
[709.715s][info   ][gc           ] GC(3736) Pause Young (Allocation Failure) 4042M->2509M(16192M) 14.642ms
[709.715s][info   ][gc,cpu       ] GC(3736) User=0.29s Sys=0.00s Real=0.02s
[709.829s][info   ][gc,start     ] GC(3737) Pause Young (Allocation Failure)
[709.829s][info   ][gc,task      ] GC(3737) Using 23 workers of 23 for evacuation
[709.851s][info   ][gc,heap      ] GC(3737) ParNew: 1730631K->168343K(1763584K)
[709.851s][info   ][gc,heap      ] GC(3737) CMS: 2404836K->2404837K(14817728K)
[709.851s][info   ][gc,metaspace ] GC(3737) Metaspace: 93132K->93132K(1140736K)
[709.851s][info   ][gc           ] GC(3737) Pause Young (Allocation Failure) 4038M->2512M(16192M) 21.583ms
[709.851s][info   ][gc,cpu       ] GC(3737) User=0.26s Sys=0.10s Real=0.02s
[709.967s][info   ][gc,start     ] GC(3738) Pause Young (Allocation Failure)
[709.967s][info   ][gc,task      ] GC(3738) Using 23 workers of 23 for evacuation
[709.982s][info   ][gc,heap      ] GC(3738) ParNew: 1734121K->169792K(1763584K)
[709.982s][info   ][gc,heap      ] GC(3738) CMS: 2404837K->2404837K(14817728K)
[709.982s][info   ][gc,metaspace ] GC(3738) Metaspace: 93132K->93132K(1140736K)
[709.982s][info   ][gc           ] GC(3738) Pause Young (Allocation Failure) 4041M->2514M(16192M) 15.107ms
[709.982s][info   ][gc,cpu       ] GC(3738) User=0.27s Sys=0.04s Real=0.02s
[710.099s][info   ][gc,start     ] GC(3739) Pause Young (Allocation Failure)
[710.099s][info   ][gc,task      ] GC(3739) Using 23 workers of 23 for evacuation
[710.115s][info   ][gc,heap      ] GC(3739) ParNew: 1736629K->161248K(1763584K)
[710.115s][info   ][gc,heap      ] GC(3739) CMS: 2404837K->2404837K(14817728K)
[710.115s][info   ][gc,metaspace ] GC(3739) Metaspace: 93132K->93132K(1140736K)
[710.115s][info   ][gc           ] GC(3739) Pause Young (Allocation Failure) 4044M->2505M(16192M) 16.103ms
[710.115s][info   ][gc,cpu       ] GC(3739) User=0.28s Sys=0.01s Real=0.02s

some more gc logs

[709.586s][info ][gc ] GC(3735) Pause Young (Allocation Failure) 4043M->2514M(16192M) 14.566ms
[709.715s][info ][gc ] GC(3736) Pause Young (Allocation Failure) 4042M->2509M(16192M) 14.642ms
[709.851s][info ][gc ] GC(3737) Pause Young (Allocation Failure) 4038M->2512M(16192M) 21.583ms
[709.982s][info ][gc ] GC(3738) Pause Young (Allocation Failure) 4041M->2514M(16192M) 15.107ms
[710.115s][info ][gc ] GC(3739) Pause Young (Allocation Failure) 4044M->2505M(16192M) 16.103ms

In 529 ms the JVM had 82 ms of GC pauses. 15% of time paused in GC is a huge number, but still leaves 85% of the time for the JVM to do something. I could write more about these GC logs but the short version is that you do not have a GC problem.

The filewatch log

@filename='file1.error.0', ...@bytes_read='222521798', @bytes_unread='0', current_size='222521798'

indicates it has read all of that file. Are the events from that file in elasticsearch?

Thanks, no, events are not being ingested, and after some time (45 minutes cca) i get index created with 23 documents in it (it should be only 2), and I can't search for them, it takes 10 seconds on GET index/_search and it shows only 1 document but with one field which is empty..

If i leave max_lines empty, then it's processing a bunch of documents, but most of the fields are empty because I guess he's doing a bulk before scanning the complete document. Looks like he's taking N number of lines and doing a bulk, and so on.. I need to set max lines to bigger number, but is it supposed to be 6 million, that sounds a little bit weird.. for smaller files auto_flush 5 is working ok

Is there any option to tell logstash that one file should be one document?

Use a multiline codec. An example is here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.